CW 528

Leander Schietgat, Celine Vens, Jan Struyf, Hendrik Blockeel, Dragi Kocev, Sašo Džeroski
Predicting gene function in S. cerevisiae and A. thaliana using hierarchical multi-label decision tree ensembles

Abstract

Motivation: S. cerevisiae and A. thaliana are two well-studied organisms in biology. Despite the fact that their genomes have already been completed in 1996 and 2000 respectively, the functions of 30% to 40% of their open reading frames (ORFs) remain unclassified. Different machine learning methods have been proposed that annotate the ORFs automatically. However, it is unclear which method is to be preferred in terms of predictive performance, efficiency, interpretability, and usability. Moreover, different evaluation measures for predictive performance have been used in the literature, each showing a limited aspect of the method's performance. Results: We study the usefulness of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning decision trees that can make predictions for the ORFs automatically. We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods, while yielding equally interpretable results. The predictive accuracy of our trees, however, is still below that of some recently proposed statistical learning methods. Ensembles of such trees, on the other hand, give even better predictive results, comparable with those of state-of-the-art methods (sometimes better, sometimes worse), while the ensemble method scales much better and is easier to use. We conclude that decision tree based methods are currently the most efficient, easy-to-use, and flexible approach to ORF function prediction, flexible in the sense that they cover the spectrum from maximally interpretable to maximally accurate models. Our evaluation makes use of precision-recall-curves. We argue that this is a better evaluation criterion than previously used criteria. Our evaluation method can be seen as an additional contribution to the field. Availability: The software is freely available on http://www.cs.kuleuven.be/~dtai/clus/.

report.pdf (798K) / mailto: L. Schietgat