Statistical Inference
The CMB group intimately works on the development of statistical methods for the analysis of biological data from the three application areas "stem cells", "regulatory networks" and "metabolomics". Our model approaches range from differential equation (ODE, SDE, PDE) models to stochastic mixture models. Our main inference tools are Bayesian parameter estimation, techniques from machine learning such as independent component analysis and principal component analysis and techniques from functional data analysis such as spline approximation.
Bayesian Estimation of Latent Causes in Gene Regulatory Processes
Main contacts at CMB: Sabine Hug, Ivan Kondofersky, Christiane Fuchs, Fabian Theis

- We consider the dynamics of molecular systems, typically modeled through networks like the continuous time recurrent neural network (CTRNN). Such systems mainly consist of components for which the interaction structure is known apart from some constants. These components might be observed or unobserved.
- Moreover, we assume that the dynamics is further influenced by latent causes which follow some unknown dynamics and act on the system in an unknown manner. These latent causes might be linear or non-linear.
- The dynamics of the whole system is described in terms of ordinary differential equations and different noise models.
- The aim of our analysis is to estimate both the unknown parameters and the latent causes in the dynamic model. This is done by combination of penalized spline approximation, Bayesian parameter estimation, model selection methods and blind source separation.
- External information on the dynamics of the latent causes and on the model parameters is incorporated through prior distributions.
- Further information is provided on the LatentCauses Project website.
References
- F. Blöchl and F. Theis. Estimating hidden influences in metabolic and gene regulatory networks. In Proc. ICA 2009, volume 5441 of LNCS, pages 387-394. Springer, Paraty, Brazil, 2009. [ .pdf ]
- S. Hug and F. Theis. Bayesian inference of latent causes in gene regulatory dynamics. In Proc. LVA/ICA 2012, volume 7191 of LNCS, pages 520-527. Springer, Tel Aviv, Israel, 2012. [ .pdf ]
Estimation of Single-cell Heterogeneities from Cell Populations
Main contacts at CMB: Christiane Fuchs, Fabian Theis

- Even when appearing perfectly homogeneous on a morphological basis, tissues can be substantially heterogeneous in single-cell molecular expression. As such heterogeneities might govern the regulation of cell fate, one is interested in quantifying the heterogeneities in a given tissue.
- Unfortunately, it is very expensive to measure the gene expressions of single cells in order to examine whether they follow a heterogeneous distribution; instead, small numbers of cells are randomly selected, and this subpopulation average expression level is measured.
- We investigate how heterogeneities can be detected from such data by application of statistical methods, and how the proportions, mean values and standard deviations of the groups of differently expressed cells can be estimated.
- Application to measurements from human breast epithelial cells revealed the functional relevance of the heterogeneous expression of a particular gene.
References
- S. Bajikar, C. Fuchs, A. Kowarsch, F. Theis and K. Janes. Inferring single-cell gene expression frequencies from stochastic transcriptional profiles, submitted.
Bayesian Blind Source Separation for Microarray Data
Main contacts at CMB: Katrin Illner, Christiane Fuchs, Fabian Theis

- In many bioinformatic applications we deal with heterogeneous large scale data like microRNA, mRNA or protein level data, and in most cases we already know some interactions and dependencies between these variables.
- In the proposed work we focus on this aspect and introduce a version of blind source separation, i.e. we assume that our measurements (observations) are linear mixtures of some underlying biological processes (sources), where we explicitly include the dependency structure as prior knowledge.
- More precisely, we generalize an idea from the context of time series data. Instead of considering the time-delayed correlation, we indroduce a graph-delayed correlation for networks and assume different sources to have vanishing graph-delayed correlation. We define our model in a probabilistic way and formulate the expectation maximization algorithm to get maximum a posteriori estimates of parameters and hidden sources.
- This work refines the ideas of the GraDe algorithm which was developed in our group as well.
References
- K. Illner, C. Fuchs and F. Theis. Blind source separation using latent Gaussian graphical models. Proceedings of the Ninth International Workshop on Computational Systems Biology, WCSB 2012, June 4-6, 2012, Ulm, Germany. [ .pdf ]
Bayesian Inference and Model Selection for ODE Models
Main contact at CMB: Sabine Hug

- We consider high-dimensional dynamical models which can appropriately be described in terms of ordinary differential equations. These models contain a typically large number of unknown parameters which we aim to infer.
- However, parameter estimation is challenging because of the high dimensionality of the joint distribution and the therefore very low effective sampling size of a standard MCMC sampler. Hence, a novel independence MCMC sampler based on a vine copula decomposition of the proposal function is developed.
- Model selection is performed by computation of Bayes factors through thermodynamic integration.
- We applied our methods for inferring the processing of zirconium in the human body. Our analysis could validate a previously proposed new model over an established one. Furthermore, the Bayesian approach allows for retrospective dosimetry and predictions in a significantly more reliable fashion than previously possible.
References
- D. Schmidl, S. Hug, W. Li, M. Greiter and F. Theis. Bayesian model selection validates a biokinetic model for zirconium processing in humans. BMC Systems Biology 2012, 6:95. [ .pdf ]
Modeling and Bayesian Inference for Diffusions
Main contact at CMB: Christiane Fuchs
- Diffusion processes are a promising instrument to realistically model the time-continuous evolution of phenomena in biology as they combine the advantages of probabilistic models and differential equation models. However, both the correct approximation of such dynamics in terms of diffusion processes and the statistical inference for diffusions proves to be challenging in practice.
- In the CMB group we are intimately involved in diffusion modelling and the development of Bayesian estimation techniques for diffusions [Dargatz 2010]. The application of diffusion processes to fluorescence microscopy data and single-cell data yields promising results and shows the potential of this approach.

References
- C. Fuchs (2013). Inference for Diffusion Processes with Applications in Life Sciences. Springer, Heidelberg.
Mixture Models
Main contact at CMB: Fabian Theis
- A key aspect in the successful application of mixture models to data from bioinformatics is to control for robustness of the algorithm with respect to missing data and outliers. We have described and analysed a blind source separation method for this in [Theis et al, Proc. ICA 2010] by applying the concept of scatter matrices. In a parallel project, we have studied how the application of Bayesian matrix factorization methods may yield an improvement over frequentist approaches; indeed it turned out that the resulting methods are more robust against outliers, which allowed the analysis and interpretation of a genome-scale principal component analysis [Theis et al, Mol Biol and Evol, 2011].
- We have applied these mixture model and factorization techniques to the efficient clustering of large-scale heterogeneous interaction networks, derived either from experimental data or from text mining in collaboration with the BIS group, IBIS; results on clustering a manually annotated gene-complex graph shows significant homogeneity between gene and corresponding complex clusters [Hartsperger et al., BMC Bioinformatics, 2010]. We expect this tool to be valuable in the preprocessing of large-scale interaction data in order to build smaller scale models.
References
- F. Theis, N. Latif, P. Wong and D. Frishman. Complex principal component and correlation structure of 16 yeast genomic variables. Molecular Biology and Evolution, accepted, 2011. 10.1093/molbev/msr077. [ DOI | PubMed | .pdf ]
- F. Theis, N. Müller, C. Plant and C. Böhm. Robust second-order source separation identifies experimental responses in biomedical imaging. In Proc. ICA 2010, volume 6365 of LNCS, pages 466-473. Springer, St. Malo, France, 2010. [ .pdf ]
- M. Hartsperger, F. Blöchl, V. Stümpflen and F. Theis. Structuring heterogeneous biological information using fuzzy clustering of k-partite graphs. BMC Bioinformatics. 2010, 11:522. [ .pdf ] [ webpage ]
