StReBio'09

KDD-09 workshop on statistical relational mining and learning in bioinformatics

Paris, Sunday June 28th, 2009

Objectives

Bioinformatics is an application domain where information is naturally represented in terms of relations between heterogenous objects. Modern experimentation and data acquisition techniques allow the study of complex interactions in biological systems. This raises interesting challenges for machine learning and data mining researchers, as the amount of data is huge, some information can not be observed, and measurements may be noisy.

The StReBio workshop aims at bringing together researchers from both the field of statistical relational learning, and the field of bioinformatics. Our main goals are to provide a common venue for both communities, and to stimulate the development of new techniques in the field of statistical relational learning specific for biological applications.

Programme

09:00-10:00Session 1
Invited Talk: David Balding
Simultaneous Analysis of all SNPs in Genome-Wide and Resequencing Association Studies
10:00-10:30Coffee Break
10:30-12:30Session 2
Problem Statement: H.P. Shanahan,
Can we improve on the identification of Transcription Factor Binding Sites Using String Kernels?

Jan Ramon, Fabrizio Costa,
Handling Missing Values and Censored Data in PCA of Pharmacological Matrices

Huma Lodhi, Stephen Muggleton, Mike J E Sternberg
Multi-Class Protein Fold Recognition using Large Margin Logic based Divide and Conquer Learning
12:30-14:00Lunch
14:00-15:30Session 3
Vlado Kešelj, Haibin Liu, Norbert Zeh, Christian Blouin, Chris Whidden,
Finding Optimal Parameters for Edit-Distance Based Sequence Classification is NP-Hard

W. Hämäläinen,
Lift-based Search for Significant Dependencies in Dense Data Sets

Tammy M. K. Cheng, Yu-En Lu, Pietro Lió,
Identification of Structurally Important Amino Acids in Proteins by Graph-Theoretic Measures
15:30-16:00Coffee Break
16:00-17:30Session 4
Zoran Obradovic, Uros Midic, A. Keith Dunker,
Protein Sequence Alignment and Intrinsic Disorder: A Substitution Matrix for an Extended Alphabet

Rabie Saidi, Mondher Maddouri, Engelbert M. Nguifo,
Comparing Graph-based Representations of Protein for Mining Purposes

Jorge M. Arevalillo, Hilario Navarro,
Using Random Forests to Uncover Bivariate Interactions in High Dimensional Small Data Sets

Joint Events with KDD Conference
18:00-18:15Opening Remarks
18:15-18:45Award Presentations
18:45-19:30Innovation Award Talk

Invited speaker

We are happy to announce that David Balding (professor of statistical genetics, Imperial College London) will give an invited talk.

SIMULTANEOUS ANALYSIS OF ALL SNPS IN GENOME-WIDE AND RESEQUENCING ASSOCIATION STUDIES Clive Hoggart1, John Whittaker2, Maria De Iorio1 and David Balding1

1. Department of Epidemiology and Public Health, Imperial College London
2. Non-communicable Disease Epidemiology Unit, London School of Hygiene and Tropical Medicine

Testing one marker at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNP markers from a genome-wide study, to identify the subset that best predicts disease outcome, is now feasible thanks to developments in stochastic search methods. We employ an approach that maximises a posterior density (or penalised likelihood), where every SNP can be considered for additive, dominant and recessive contributions to disease risk. Posterior mode estimates are obtained for regression coefficients that are each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate is interpreted as corresponding to a significant SNP. We investigate two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derive an explicit approximation for type-I error that avoids the need to employ permutation procedures. As well as genome-wide analyses, our method is well-suited to fine-mapping with very dense SNP-sets obtained from resequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions and for prediction of disease status. We demonstrate the method using simulated case-control data sets of up to 500K SNPs, a real genome-wide data set of 300K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation. The talk is based on PLoS Genetics 4(7): e1000130. doi:10.1371/journal.pgen.1000130 and recent work.

Organisation

Chairs: Jan Ramon, Fabrizio Costa, Christophe Costa Florencio, Joost Kok

Call for contributions

Objectives

Bioinformatics is an application domain where information is naturally represented in terms of relations between heterogenous objects. Modern experimentation and data acquisition techniques allow the study of complex interactions in biological systems. This raises interesting challenges for machine learning and data mining researchers, as the amount of data is huge, some information can not be observed, and measurements may be noisy.

The StReBio workshop aims at bringing together researchers from both the field of statistical relational learning, and the field of bioinformatics. Our main goals are to provide a common venue for both communities, and to stimulate the development of new techniques in the field of statistical relational learning specific for biological applications.

Contribution types

We invite contributions of the following types:
  • Regular papers, describing work in the area of the workshop.
  • Open problem papers, describing challenges and open problems
  • Challenge solution papers, describing solutions of open problems presented at StReBio'08.
    Here is a list of problems.

Topics of interest

The workshop is concerned with presentations describing new methods, problem settings, applications and models, exploiting structured data in the field of biology. Methods include, but are not restricted to
  • Statistical Relational Learning
  • Relational Probabilistic Models
  • Multi-relational Data Mining
  • Graph Methods
The data, structures or models considered can include but are not limited to
  • Sequences (DNA, RNA, protein)
  • Pathways (chemical, metabolic, mutation, interaction pathways)
  • 2D, 3D structures of proteins, RNA
  • Chemical structures (e.g. QSAR, especially regarding interaction of compounds with proteins)
  • Evolutionary relations (phylogeny, homology relations)
  • Ontologies integration (gene, enzyme, protein function ontologies)
  • Large networks (regulatory, co-expression, interaction, metabolic, ...)
  • Concept graphs (heterogenuous graphs linking information on articles, authors and biological entities such as compounds, proteins, genes, ...

Submission Instructions

  • Length of paper: short papers are encouraged - Limit to FOUR (4) pages max in ACM style
  • Format: PDF
  • Templates: download ACM style from here
  • Submission: send the pdf file by email to strebio09@cs.kuleuven.be

Registration

As explained on the KDD registration webpage, one can either register for the full ACM-SIGKDD conference (including workshops, tutorials and the full technical program), or only for workshops/tutorials/evenings (excluding the technical program of the main conference).

Important Dates

  • Submission: Apr 27 (extended)
  • Notification: May 15
  • Camera-ready copy: May 22
  • Workshop: June 28

Past editions