View on GitHub

scRNA

Python framework for single-cell RNA-seq clustering with special focus on transfer learning.

scRNA - Transfer learning for clustering single-cell RNA-Seq data

A Python framework for single-cell RNA-Seq clustering with special focus on transfer learning.

This package contains methods for generating artificial data, clustering, and transfering knowledge from a source to a target dataset.

This software package is developed by Nico Goernitz, Bettina Mieth, Marina Vidovic, and Alex Gutteridge.

Travis-CI

Publication

The Python framework and this website are part of a publication currently under peer-review at Nature Scientific Reports. Links to the published paper and online supplementary material will be included here once available.

Abstract: In many research areas scientists are interested in clustering objects within small datasets while making use of prior knowledge from large reference datasets. We propose a method to apply the machine learning concept of transfer learning to unsupervised clustering problems and show its effectiveness in the field of single-cell RNA sequencing (scRNA-Seq). The goal of scRNA-Seq experiments is often the definition and cataloguing of cell types from the transcriptional output of individual cells. To improve the clustering of small disease- or tissue-specific datasets, for which the identification of rare cell types is often problematic, we propose a transfer learning method to utilize large and well-annotated reference datasets, such as those produced by the Human Cell Atlas. Our approach modifies the dataset of interest while incorporating key information from the larger reference dataset via Non-negative Matrix Factorization (NMF). The modified dataset is subsequently provided to a clustering algorithm. We empirically evaluate the benefits of our approach on simulated scRNA-Seq data as well as on publicly available datasets. Finally, we present results for the analysis of a recently published small dataset and find improved clustering when transferring knowledge from a large reference dataset.

News

Getting started

Installation

We assume that Python is installed and the pip command is callable from the command line. If starting from scratch, we recommend installing the Anaconda open data science platform (w/ Python 3) which comes with a bunch of most useful packages for scientific computing.

The scRNA software package can be installed using the pip install git+https://github.com/nicococo/scRNA.git command. After successful completion, three command line arguments will be available for MacOS and Linux only:

Example

Step 1: Installation with pip install git+https://github.com/nicococo/scRNA.git Installation with _pip install git+https://github.com/nicococo/scRNA.git_

Step 2: Check the scripts Check for the scripts

Step 3: Create directory /foo. Go to directory /foo. Generate some artificial data by simply calling the scRNA-generate-data.sh (using only default parameters).

Generate artificial data

This will result in a number of files:

Step 4: NMF of source data using the provided gene ids and source data. Ie. we want to turn off the cell- and gene-filter as well as the log transformation. You can provide source labels to be used as a starting point for NMF. If not those labels will be generated via NMF Clustering. Potential problems:

Cluster the source data

This will result in a number of files:

Step 5: Now, it is time to cluster the target data and transfer knowledge from the source model to our target data. Therefore, we need to choose a source data model which was generated in Step 4. In this example, we will pick the model with 8 cluster (src_c8.npz).

Cluster the target data

Which results in a number of files (for each value in the cluster range).

In addition there is a summarizing .png figure of all accs and a t-SNE plot with the real target labels, if they were provided.

Cluster the target data

Command line output shows a number of results: unsupervised and supervised (if no ground truth labels are given this will remain 0.) accuracy measures.

Example application

Using Jupyter notebooks, we showcase the main workflow as well as the abilities of the application. The main features are

The Jupyter notebook can be accessed under https://github.com/nicococo/scRNA/blob/master/notebooks/example.ipynb

Replicating experiments

In the course of our research (Mieth et al., see references below) we have investigated the performance of the proposed method in comparison with the most important baseline methods firstly in a simulation study on generated data, secondly on subsampled real data (Tasic et al.) and finally on two independent real datasets (Hockley et al. and Usoskin et al.). We have also shown, that batch effect removal approaches (Butler et al.) and imputation methods (Van Dijk et al.) can be used to further improve clustering results when applying our method.

To fully reproduce the experiments of our study you can find the corresponding scripts at the following links:

For producing the figures of the paper go to:

For evaluating the robustness experiments producing table 1 of the paper go to:

Parameter Selection

All pre-processing parameters of the experiments presented in the paper can be found in the corresponding scripts (above) and in the online supplementary material (for the generated datasets in Supplementary Sections 2.1, for the Tasic data in Supplementary Sections 3.1 and for the Hockley and Usoskin datasets in Supplementary Sections 4.1. Details on all other parameters of the respective datasets can also be found in the scripts or in the corresponding sections of the supplementary online material (Supplementary Sections 2.2, 3.2 and 4.2, respectively).

Data availability

The datasets analyzed during the current study are available in the following GEO repositories: