Machine Learning

Single-cell RNA sequencing (scRNA-seq) has enabled researchers to study gene expression at a cellular resolution. However, noise due to amplification and dropout may obstruct analyses, so scalable denoising methods for increasingly large but sparse scRNAseq data are needed. We propose a deep count autoencoder network (DCA) to denoise scRNA-seq datasets. DCA takes the count distribution, overdispersion and sparsity of the data into account using a zero-inflated negative binomial noise model, and nonlinear gene-gene or gene-dispersion interactions are captured. Our method scales linearly with the number of cells and can therefore be applied to datasets of millions of cells. We demonstrate that DCA denoising improves a diverse set of typical scRNA-seq data analyses using simulated and real datasets. DCA outperforms existing methods for data imputation in quality and speed, enhancing biological discovery.

Cellular development has traditionally been described as a series of transitions between discrete cell states, such as the sequence of double negative, double positive and single positive stages in T-cell development. Recent advances in single cell transcriptomics suggest an alternative description of development, in which cells follow continuous transcriptomic trajectories. A cell's state along such a trajectory can be captured with pseudotemporal ordering, which however is not able to predict development of the system in real time. We present pseudodynamics, a mathematical framework that integrates time-series and genetic knock-out information with such transcriptome-based descriptions in order to describe and analyze the real-time evolution of the system. Pseudodynamics models the distribution of a cell population across a continuous cell state coordinate over time based on a stochastic differential equation along developmental trajectories and random switching between trajectories in branching regions. To illustrate feasibility, we use pseudodynamics to estimate cell-state-dependent growth and differentiation of thymic T-cell development. The model approximates a developmental potential function (Waddington's landscape) and suggests that thymic T-cell development is biphasic and not strictly deterministic before beta-selection. Pseudodynamics generalizes classical discrete population models to continuous states and thus opens possibilities such as probabilistic model selection to single cell genomics.

Single-cell RNA-seq allows quantification of biological heterogeneity across both discrete cell types and continuous cell differentiation transitions. We present approximate graph abstraction (AGA), an algorithm that reconciles the computational analysis strategies of clustering and trajectory inference by explaining cell-to-cell variation both in terms of discrete and continuous latent variables ( This enables to generate cellular maps of differentiation manifolds with complex topologies --- efficiently and robustly across different datasets. Approximate graph abstraction quantifies the connectivity of partitions of a neighborhood graph of single cells, thereby generating a much simpler abstracted graph whose nodes label the partitions. Together with a random walk-based distance measure, this generates a topology preserving map of single cells --- a partial coordinatization of data useful for exploring and explaining its variation. We use the abstracted graph to assess which subsets of data are better explained by discrete clusters than by a continuous variable, to trace gene expression changes along aggregated single-cell paths through data and to infer abstracted trees that best explain the global topology of data. We demonstrate the power of the method by reconstructing differentiation processes with high numbers of branchings from single-cell gene expression datasets and by identifying biological trajectories from single-cell imaging data using a deep-learning based distance metric. Along with the method, we introduce measures for the connectivity of graph partitions, generalize random-walk based distance measures to disconnected graphs and introduce a path-based measure for topological similarity between graphs. Graph abstraction is computationally efficient and provides speedups of at least 30 times when compared to algorithms for the inference of lineage trees.


Application: Cell type atlas and lineage tree of a whole complex animal, Science (2018)

Batch Effect Quantification

The new test metric kBET allows assessment of batch-correction methods for single-cell RNAseq 

Batch effects in single-cell RNAseq have a critical impact on the analysis as they may cover up biological signals. We have developped kBET (k-nearest neighbour batch effect test) to quantify any type of batch effect present in single-cell RNAseq data. Using this tool, we have studied common normalisation and batch-correction approaches for their ability to remove batch effects and to preserve biological signals. 

Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. It is currently (December 2017) the only package that can tackle the recently exploding dataset sizes without subsampling, scaling to more than one million cells.

Wolf, Angerer & Theis, Genome Biology (2018) 19:15,

We showed how to reconstruct continuous biological processes using deep learning for the examples of cell cycle and disease progression in diabetic retinopathy.

Eulenberg, Köhler, et al., Nature Communications (2017),

Inference of pseudotime for differentiating cells

  • Haghverdi, L.; Büttner,M.; Wolf, F.A.; Buettner, F.; Theis, F.J.: Diffusion pseudotime robustly reconstructs lineage branching. Nature Methods (2016)

Computational analysis of cell-to-cell heterogeneity in single-cell RNA-Sequencing data reveals hidden subpopulation of cells

Our single-cell latent variable model (scLVM) allows the identification of otherwise undetectable subpopulations of cells.

Publication: Nat. Biotechnol. 33, 155-160 (2015)

Linking of cellular morphology to function

A method based on image flow data, simple machine learning, which is currently being extended to deep learning:

  • Blasi, T., Hennig, H., Summers, H. D., Theis, F. J., Cerveira, J., Patterson, J. O., et al. (2015). Label-free cell cycle analysis for high-throughput imaging flow cytometry. Nature Communications7, 1–9.

Diffusion maps and Destiny

  • Diffusion maps are a spectral method for non-linear dimension reduction, which destiny adapts for the visualization of single cell expression data. Those adaptions include a single cell specific noise model for missing and censored valuesHaghverdi et al., Bioinformatics (2015), an efficient nearest-neighbour approximation for the processing of hundreds of thousands of cells, and a functionality for projecting new data on existing diffusion maps. More infomation.
  • R/Bioconductor package
  • Paper

Application:  Moignard, V., et al. (2015). Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nature Biotechnology33(3), 269–276.

Gaussian Process Latent Variable Models for characterising heterogeneities of blood stem/progenitor cells

Approximate inference for performing GO enrichment studies based on cross-omics data

We use approximate inference methods (expectation propagation) to perform GO enrichment studies We take information on different species into account simultaneously

  • DNA methylation and gene expression
  • protein and gene expression
  • microRNA and gene expression

Software can be found here

Stochastic Profiling

Project Website

  • S.S. Bajikar*, C. Fuchs*, A. Roller, F.J. Theis°, K.A. Janes°: Parameterizing cell-to-cell regulatory heterogeneities via stochastic transcriptional profiles. PNAS 2014, 111(5), E626-635