HMC Software and Datasets

This page contains supporting materials for the paper L. Schietgat, C. Vens, J. Struyf, H. Blockeel, D. Kocev, S. Džeroski, "Predicting gene function using hierarchical multi-label decision tree ensembles", BMC Bioinformatics 2010, 11:2.

Software

The Clus-HMC and Clus-HMC-Ens algorithms are implemented in the Clus system.

Datasets

The datasets used in our experimental comparison are from the field of functional genomics. Amanda Clare, Zafer Barutcuoglu, Tim Hughes and Fritz Roth kindly provided us with the datasets. The original versions of D1-D18 can be found here. The datasets originate from the organisms S. cerevisiae and A. thaliana and have annotations from the MIPS Functional Catalogue and the molecular function branch of Gene Ontology. Dataset D19 originates from the organism M. musculus and has annotations from the 3 branches of the Gene Ontology. The original data can be found here.

The datasets are recorded in Weka's arff format, and are ready to be used with Clus. For each dataset, there are 3 arff files: train, valid, and test. The file valid was used in our article to tune the f-test stopping criterion. The final model, constructed on the union of train and valid, was tested on test.

S. cerevisiae datasets

FunCat annotated datasets

Gene Ontology annotated datasets

D0(GO)

A. thaliana datasets

FunCat annotated datasets

Gene Ontology annotated datasets

M. musculus dataset

D19(GO)

Several of these datasets suffer from non-unique feature representations, making the learning task more difficult. More information about this issue can be found in Pliakos et al., Representational power of gene features for function prediction, 2015.

Parameter settings for Clus-HMC(-Ens)

Example settings files to be used with Clus-HMC(-Ens):
- FunCat annotated datasets
- GO annotated datasets
To run Clus-HMC-Ens, include the command line option "-forest" when running Clus.

Data files for figures in the paper

Pooled AUPRC comparison (Fig. 3): csv file
Average AUPRC comparison (Fig. 7): csv file
Average precision at C4.5H/M's recall (Fig. 8): csv file
AUROC comparison (Fig. 12): csv file

Questions?

Please direct questions about Clus-HMC(-Ens) to Leander Schietgat, Celine Vens, and Jan Struyf.