snRNA-seq based biomedical research

Case-control studies are commonly adopted in biomedicine research to discover the risk factors associated with diseases. It is especially suitable for pioneering studies in diseases lacking clear revealing mechanisms to design randomized trials and for rare diseases or for biomedical laboratories that cannot recruit enough cases to conduct cohort or longitudinal studies [levin2005].

The advent of omics-based strategies drives the investigation of risk factors to a more precise and personalized molecular level. The transcriptome, for example, reflects the functional tissue or cellular states that link diseases and helps to evaluate which functionally expressed genes are specifically associated with a phenotype in key tissues. These strategies are based on the accuracy and resolution of data generation and learning, i.e., new sequencing technology and customized data analysis approaches.

The confounded composition of cell types in bulk sequencing data has long hindered precise identification of cellular and molecular targets associated with disease phenotypes. Tissues catalogued in the Human Protein Atlas comprise on average over ten cell types [Alam2021]. In case-control studies, distinguishing disease-affected cells from healthy functional cells—or from immune populations that modulate disease progression—is essential. The advent of scRNA-seq and snRNA-seq since the first study in 2009 [Tang2009] has transformed this landscape. These technologies quantify transcript abundance at single-cell or single-nucleus resolution, capturing signatures of cell identity, functional states, individual variation, and responses to treatment, pathology, or experimental perturbation. Given ergodic principles [Wergeland1958], high-coverage sampling across many cells at a single timepoint can approximate the distribution of transcriptomic states over short timescales, enabling pseudotemporal ordering [Trapnell2014]. This temporal resolution is further refined by incorporating transcriptional [LaManno2018] or post-transcriptional kinetics [Xu2023]. Compared to scRNA-seq, snRNA-seq does not require intact cells during dissociation, making it particularly suited to archival biobank specimens in retrospective case-control designs. Data modalities are also closely tied to sequencing platforms and technologies. The scRNA-seq revolution began with single-cell cDNA amplification and sequencing on the SOLiD platform [Tang2009], and a variety of single-cell isolation, library construction, and sequencing technologies are now available or under active development. The most commonly used workflow is 3’ whole-transcriptome gene expression profiling, exemplified by 10x Genomics Chromium. This platform employs microfluidic technology for single-cell isolation [Marcus2006], encapsulating barcoded primer beads within droplets to generate sequencing libraries. To account for amplification duplicates, UMIs are incorporated into each molecule during initial library preparation. The resulting libraries are compatible with next-generation sequencing on Illumina platforms, which output FASTQ files containing nucleotide sequences and associated quality scores [Cock2010]. Reads from Illumina sequencers typically include a unique instrument identifier in the read header. Subsequently, reads are aligned to a reference genome and, guided by gene annotation files, those mapping confidently to the transcriptome and uniquely to a single gene are identified. The corresponding UMIs are then counted to generate a UMIs count matrix for downstream analysis.

Despite amazing progress in sequencing technology to produce omics data from single cells or nuclei to atlas the common characteristics of cell types in an organism, the case-control study leveraging barcode resolution remains challenging, with less than 100 published research articles from 2018 (Fig. 1a). There are 111 datasets available on 80 diseases from human cell atlas [regev2017]. If we extend the scope to include also treatment studies, there are 155 NIH Bioprojects registered with scRNA- or snRNA-seq performed on patient samples or patient-derived cell systems (Fig. 1) (source data). The study is costly in both the library preparation and sequencing steps (Table 1). Therefore, the study requires meticulous consideration given a limited budget in the number of samples, number of cells per sample, and number of reads per cell to meet the needed power according to the specific aim of the study [lafzi2018]. Moreover, the medical insights from the high-cost study are limited by the correctness of data analysis in decomposing the variance due to individual or batch differences and the depth of analysis to utilize the rich sampling of the nuclei or cells. Numerous computational analysis strategies were established to overcome the difficulties from study design, and noise removal to probing biological questions such as the differential expression or eQTL.

Statistical distributions of count data.

Sequencing a cDNA library to depth $N$ can be conceptualized as repeatedly sampling $N$ molecules, where each RNA species is drawn with probability $p$ proportional to its abundance. These sampling events are independent integer-valued occurrences. When $N$ is sufficiently large, the distribution converges toward a Poisson distribution. However, because variance and mean are often independently variable due to biological and technical heterogeneity, the NB distribution is more commonly employed to model scRNA-seq data. The NB distribution accounts for the probability of observing $N-n$ unsuccessful sampling attempts before achieving $n$ successful detections of a particular RNA species, where $n$ corresponds to the entry in the count matrix.

Goodness-of-fit evaluations reveal the applicability of these distributions. In 12 of 18 examined 10X Genomics datasets, the NB distribution outperformed the Poisson, zero-inflated Poisson, and ZINB distributions [Vieth2017powsimR]. However, in certain UMI-based datasets, a Poisson distribution may provide an adequate fit [Ziegenhain2017]. Notably, scRNA-seq data often exhibit zero inflation [Qiu2020], whereas UMI-based quantification typically attenuates this feature [Svensson2021]. In some cases, the ZINB model improves goodness-of-fit by only 1.6% [Vieth2017powsimR]. Researchers should therefore carefully examine the distribution of gene expression values in their specific datasets, accounting for dropout events, before selecting a distribution for modeling and constructing background controls. Furthermore, when choosing normalization methods, one must consider the raw data distribution to ensure alignment with the assumptions required for downstream analyzes.

snRNA-seq based biomedical research

snRNA-seq based biomedical research

Statistical distributions of count data.

CATALOG

FEATURED TAGS

FRIENDS