Development of computational methods

Scanpy: large-scale single-cell gene expression data analysis

Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. It is currently (December 2017) the only package that can tackle the recently exploding dataset sizes without subsampling, scaling to more than one million cells.

As more analysis tools are becoming available, it is becoming increasingly difficult to navigate this landscape and produce an up-to-date workflow to analyse one’s data. We detail the steps of a typical single-cell RNA-seq analysis, including pre-processing (quality control, normalization, data correction, feature selection, and dimensionality reduction) and cell- and gene-level downstream analysis. We have integrated these best-practice recommendations into a workflow, which we apply to a public dataset to further illustrate how these steps work in practice.

Related Publications:


Generalizing RNA velocity to transient cell states through dynamical modeling

scVelo is a method that generalizes RNA velocity to systems with transient cell states by solving the full transcriptional dynamics of splicing kinetics using a likelihood-based dynamical model. scVelo identifies regimes of regulatory changes such as stages of cell fate commitment and, therein, systematically detects putative driver genes. Tools like scVelo can give rise to personalized treatments, as it allows biologists to predict the future state of individual cells. It has, for instance, been recently used to analyze developing neutrophils in severe COVID-19 patients (Wilk et al. 2020).





CellRank - Probabilistic Fate Mapping using RNA Velocity

CellRank is a toolkit to uncover cellular dynamics based on scRNA-seq data with RNA velocity annotations. CellRank models cellular dynamics as a Markov chain, where transition probabilities are computed based on RNA velocity and transcriptomic similarity, taking into account uncertainty in the velocities and the stochastic nature of cell fate decisions.


  • Lange et al. (bioRxIv, 2020 Preprint)


Scaling single-cell analysis to large cell numbers and deep-learning-based latent space estimation

Our software framework Scanpy is able to comprehensively analyze large-scale single-cell transcriptome datasets and is used by many labs as well as the Human Cell Atlas with more than 200k downloads on github; the first to allow 'big data' analysis tasks with a broad range of machine-learning and statistical methods for single-cell gene-expression of millions of cells. In addition, we propose to use a neural network model called “deep count autoencoder” extending autoencoders to noise models common in single cell genomics and apply it for denoising. Since it is based on efficient GPU-based parameter optimization as common in neural networks, it significantly outperforms existing approaches in terms of scaling properties. We have recently extended these frameworks for large-scale data integration tasks. By leveraging advancements in machine learning, we propose a deep learning strategy to map query datasets onto integrated reference called single-cell architectural surgery (scArches). It uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building, and the contextualization of new datasets with existing references without sharing raw data. Furthermore, to essentially augment data sharing in analysis and collaborations we developed Sfaira, a model as well as a data repository, to enable users to get access to single-cell datasets and pre-trained models via streamlined data loaders.


GitHub GitHub

AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution

Cell-type deconvolution of bulk RNA-seq data using automatic gene selection technique that optimizes multiple criteria such as minimizing correlation and maximizing distance between cell types. Performing cellular deconvolution with AutoGeneS we assess the expression patterns of genes required for SARS-CoV-2 entry into cells, and their regulation by genetic, epigenetic and environmental factors, throughout the respiratory tract using samples collected from the upper (nasal) and lower airways (bronchi).



  • Aliee et al. (bioRxiv, 2020 Preprint)
  • Aliee et al. (medRxiv, 2020 Preprint)




Estimating and visualizing lineages in single cell genomics


To understand heterogeneity in single cell genomics in particular, common linear or nonlinear dimension reduction methods are utilized for an unbiased visualization of potential subgroups in the data. To achieve this we have used both Bayesian methods (Gaussian-Process Latent Variable Modeling) as well as nearest-neighbour graph based methods (diffusion maps). We implemented diffusion maps in a toolbox (Destiny) to allow for fast visualization by other labs. Adaptations of the model include single-cell specific noise model for missing and censored values, an efficient nearest-neighbor approximation for the processing of hundreds of thousands of cells, and a functionality for projecting new data on existing diffusion maps. Based on the idea of diffusion transition to close-by cells, we have developed a framework for pseudotemporal ordering of scRNA-seq data also in the context of branching, which we showed to be superior to existing methods such as Monocle or Wanderlust. Recently we propose partition-based graph abstraction (PAGA) to extend these approaches for deep and complex lineage estimation by summarizing large-scale scRNAseq data sets using coarse-grained neighborhood graphs. Together with the Rajewski lab, we were thus able to estimate a lineage tree for a whole organism (planaria). Furthermore, we applied these methods to generate novel biology such as a detailed understanding of intestinal stem cell (ISC) self-renewal and differentiation in order to better treat chronic intestinal diseases by integrating time-resolved lineage labelling with genome-wide and targeted single-cell gene expression analysis. In order to integrate previous mechanistic models and to model population dynamics in a time-resolved manner, we developed the 'Pseudodynamics' model. Pseudodynamics models population distribution shifts across trajectories to quantify selection pressure, population expansion, and developmental potentials. Our most recent work, scVelo, enables us to fit a dynamic model per gene (often with several million parameters) to not only estimate the position in the gene expression space, but also the direction of the cell trajectory, extending the popular RNA velocity concept.