Chapter 2 Preprocess

In this chapter, we deal with empty droplets, ambient RNAs and doublets, especially in snRNA-seq setting.

2.1 Empty droplets

We have no prior knowledge about whether a particular library (i.e., cell barcode) corresponds to cell-containing or empty droplets. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced, resulting in non-zero counts for libraries that do not contain any cell. The aim of empty droplets detection in most cases is to retains nucleus that would have been discarded by other threshold on total counts. Therefore, could use looser thresholds after this step.

2.1.1 Statistical Modeling

Cell Ranger v3.0 and DropletUtils implemented utils to distinguish empty droplets. The algorithm has three key steps:

filter out genes with counts of zero for all barcodes
estimate the profile of the ambient RNA pool (as gene count across barcodes with total counts < 100 in DropletUtils or estimate expected number of recovered cells by OrdMag alg. in Cellranger)
test each barcode for deviations from this profile using a Dirichlet-multinomial model of UMI count sampling
- Barcodes with significant deviations are considered to be genuine cells
- Use a knee point filter to ensure that barcodes with large total counts are always retained

Cellranger uses an expected-cells and DropUtils uses by.rank as initial “cells”. But with the sample initial top number of cells, two implementations give slightly different retained cells. And I don’t get why this happens.

In cellranger 7.0, it estimates expected number of recovered cells by OrdMag alg. But I didn’t get to use this method.
Cellranger doesn’t explicitly tell what FDR cutoff it uses. But retained number of nucleus fell into the range between 0.001 to 0.05 of DropUtils.
DropUtils uses barcodes with total counts < 100 as initial empty drops, but this failed with the tested datasets, evident (e.g.) as in FDR distribution:

2.1.2 Deep Learning

Cellbender provides a deep generative model of background-contaminated counts. The model is used for learning the background RNA profile, distinguishing cell-containing droplets from empty ones, and retrieving background-free gene expression profiles.
The model schema and network representation can be found in Figure 1g,h
This method has been benchmarked with UMI counts threshold and DecontX but not DropletUtils

2.1.3 Could Cellbender retain more good nucleus?

YES
I summarized the general stats from the above mentioned methods with three in-house 10x snRNA-seq data:

Thresholds filtering is necessary and the most useful data clean method.
It doesn’t matter whether by detecting outliers or just setting hard cut-offs.

2.2 Ambient contamination

I tested the combinations of the aforementioned empty droplets and ambient RNA detection methods (See diagram) with our in-house data and I’m convinced that Cellbender and cluster-based methods are of advantage.

In snRNA-seq, mtRNAs are natural controllers for ambient RNAs and thus can be leveraged in benchmarking.