Wednesday, November 28, 13:30-14:30 in Celestijnenlaan 200A room 00.144, 3001 Leuven-Heverlee
How to Extract Relevant Information from Structured Data for Learning
By Ulrich Rueckert (Techn. Universitaet Muenchen)
We present first steps towards a general framework for propositionalization and relational learning based on sequences of queries and models, and the information effectively needed to generate both. From an abstract point of view, we only consider sequences of queries sent to a database, and sequences of consecutive models that combine those queries in a decision function. On a more detailed and procedural level, we consider how the queries in a sequence are actually generated, and, in particular, which information is taken into account to do that. In this way, the framework can address the question of how well the provided information is used by different learning approaches. We apply the framework to investigate methods for extracting relevant and non-redundant information from structured data. Choosing a suitable feature representation for structured data is a non-trivial task due to the vast number of potential candidates. Ideally, one would like to pick a small, but informative set of structural features, each providing complementary information about the instances. We frame the search for a suitable feature set as a combinatorial optimization problem. For this purpose, we define a scoring function that favors features that are as dissimilar as possible to all other features. The score is used in a stochastic local search (SLS) procedure to maximize the diversity of a feature set. In experiments on small molecule data, we investigate the effectiveness of the approach with two different linear classification schemes.