Abstract: Direct Coupling Analysis (DCA) is a powerful tool to find pair-wise dependencies in large biological data sets. It amounts to inferring coefficients in a probabilistic model in an exponential family, and then using the largest such inferred coefficients as predictors for the dependencies of interest. The main computational bottle-neck is the inference. As described recently by Jukka Corander in this seminar series DCA has be done on bacterial whole-genome data, at the price of significant compute time, and investment in code optimization.
We have looked at if DCA can be speeded up by first filtering the data on correlations, an approach we call Correlation-Compressed Direct Coupling Analysis (CC-DCA). The computational bottle-neck then moves from DCA to the more standard task of finding a subset of most strongly correlated vectors in large data sets. I will describe results obtained so far, and outline what it would take to do CC-DCA on whole-genome data in human and other higher organisms.
This is joint work with Chen-Yi Gao and Hai-Jun Zhou, available as arXiv:1710.04819.
Speaker: Erik Aurell
Affiliation: Professor of Biological Physics, KTH-Royal Institute of Technology
Place of Seminar: University of Helsinki