HMC Software and Datasets

This page contains supporting materials for the paper L. Schietgat, C. Vens, J. Struyf, H. Blockeel, D. Kocev, S. Džeroski, "Predicting gene function using hierarchical multi-label decision tree ensembles", BMC Bioinformatics 2010, 11:2.

Software

The Clus-HMC and Clus-HMC-Ens algorithms are implemented in the Clus system.

Datasets

The datasets used in our experimental comparison are from the field of functional genomics. Amanda Clare, Zafer Barutcuoglu, Tim Hughes and Fritz Roth kindly provided us with the datasets. The original versions of D1-D18 can be found here. The datasets originate from the organisms S. cerevisiae and A. thaliana and have annotations from the MIPS Functional Catalogue and the molecular function branch of Gene Ontology. Dataset D19 originates from the organism M. musculus and has annotations from the 3 branches of the Gene Ontology. The original data can be found here.

The datasets are recorded in Weka's arff format, and are ready to be used with Clus. For each dataset, there are 3 arff files: train, valid, and test. The file valid was used in our article to tune the f-test stopping criterion. The final model, constructed on the union of train and valid, was tested on test.

S. cerevisiae datasets

FunCat annotated datasets          Gene Ontology annotated datasets

A. thaliana datasets

FunCat annotated datasets          Gene Ontology annotated datasets

M. musculus dataset

Several of these datasets suffer from non-unique feature representations, making the learning task more difficult. More information about this issue can be found in Pliakos et al., Representational power of gene features for function prediction, 2015.

Parameter settings for Clus-HMC(-Ens)

Data files for figures in the paper

Questions?

Please direct questions about Clus-HMC(-Ens) to Leander Schietgat, Celine Vens, and Jan Struyf.