Probabilistic Mapping of Single-Cell Transcriptomes to Bulk Tissue Profiles
Challenge: Integrating unlabeled single-cell RNA-seq (scRNA-seq) data from mixed human tissues with labeled bulk RNA-seq from those tissues requires methods that can probabilistically assign cells or cell types to their tissue of origin. Since the tissues are closely related (sharing many cell types), the mapping must allow overlapping cell-type profiles – i.e. the same cell type can exist in multiple tissues – and return probability or fraction assignments rather than hard labels. Below we outline computational strategies and tools (favoring Python implementations or custom pipelines) for tackling this integration, including deconvolution approaches, spatial mapping algorithms, and latent manifold alignment techniques. We highlight methods suited for tissues with similar cell compositions and note those providing probabilistic outputs (e.g. cell type fractions or posterior probabilities).
Deconvolution of Bulk RNA-seq Using Single-Cell References
One common strategy is cell-type deconvolution of bulk tissue RNA-seq using single-cell data as a reference. These methods treat bulk expression as a mixture of cell-type-specific signals and aim to infer the proportion of each cell type in each tissue sample. The output is essentially a probability distribution of cell types per tissue (and by extension, one can infer the probability a given cell type or cell belongs to a particular tissue). Modern deconvolution tools often account for cross-tissue differences in gene expression and provide uncertainty estimates. Notable methods include:
-
CIBERSORTx: An extension of the CIBERSORT framework, designed to use scRNA-seq signatures for bulk deconvolution. CIBERSORTx can estimate both cell type abundances and cell-type-specific gene expression profiles from bulk data. It uses machine learning (support vector regression) to fit bulk expression as a combination of reference profiles. In “High-Resolution” mode, CIBERSORTx even imputes each cell type’s expression within the bulk tissue. This tool yields fractional abundances (0–100%) for each cell type per sample, naturally handling shared cell types by assigning nonzero fractions across multiple tissues.
-
MuSiC (Multi-Subject Single-Cell Deconvolution): An R package focusing on scenarios with multiple related tissue samples. MuSiC leverages a multi-subject single-cell reference, weighting genes by their consistency across donors, to improve deconvolution in complex tissues. Importantly, MuSiC was shown to outperform other methods “especially for tissues with closely related cell types.” This makes it well-suited for overlapping cell profiles (e.g. placental regions or ocular tissues with similar cell constituents). While MuSiC is R-based, the underlying concept (a weighted linear model that borrows strength from multiple samples) could be implemented in Python (e.g. using mixed-effects models or by computing gene weights and applying non-negative least squares).
-
BayesPrism: A Bayesian deconvolution tool that explicitly models differences between the single-cell reference and bulk data. BayesPrism treats the scRNA-seq reference as a prior and computes a posterior distribution over cell type fractions for each bulk sample. By marginalizing over uncertainty in gene expression for each cell type, it can accommodate the scenario where a given cell type’s expression in bulk may differ from the reference (due to tissue-specific effects). The output is a posterior probability for each cell type fraction (with credible intervals), providing a probabilistic assignment. BayesPrism’s Bayesian framework naturally handles shared cell types across tissues by allowing that cell type’s fraction to be non-zero in multiple samples, with uncertainty estimates reflecting the confidence. (BayesPrism was published in Nature Cancer 2022, and although its implementation is R-based, its approach could inspire a Python/Stan/Pyro equivalent.)
-
Scaden (Single-Cell Assisted Deconvolutional Network): A deep learning approach implemented in Python. Scaden uses an ensemble of neural networks trained on single-cell data to predict cell type composition in bulk samples. The networks learn a representation of gene expression that is robust to technical noise and dropout, and output the fraction of each cell type. As the authors describe, “Scaden, a deep neural network for cell deconvolution, is trained on scRNA-seq data… We demonstrate that Scaden outperforms existing deconvolution algorithms in both precision and robustness, across tissues and species.”. Because it’s a data-driven model, Scaden can implicitly capture subtle gene expression differences of a cell type in different tissues (if such patterns are present in training). It produces a probability vector (summing to 1) of cell-type proportions for each bulk sample. Scaden is available as a Python package, making it easy to integrate into custom pipelines.
-
TAPE (Tissue-Adaptive Autoencoder): A recently proposed deep learning method that connects bulk and single-cell data via an interpretable autoencoder framework. TAPE uses a neural network to decompose bulk expression into cell-type contributions while simultaneously adjusting each cell type’s gene expression to be “tissue-adaptive.” As described in its introduction: “TAPE can predict cell-type fractions and cell-type-specific gene expression *tissue-adaptively. Compared with popular methods… TAPE has better overall performance… and is more robust among different cell types”*. This means TAPE is designed for scenarios where cell types are shared across tissues but might exhibit gene expression shifts per tissue. It yields both the proportion of each cell type per sample and an adjusted expression profile for each cell type in that tissue. TAPE’s autoencoder (deep learning) implementation suggests it is likely available in Python or via reproducible code (its paper is from 2022).
-
Basic Linear Deconvolution (Custom NNLS): In a custom Python pipeline, one could perform a simple non-negative least squares (NNLS) regression for each bulk sample to estimate cell type proportions. This involves taking average expression profiles for each cell type (e.g. from clustering the scRNA-seq by cell type) and solving for the weights that best reconstruct the bulk tissue profile (constraining weights ≥ 0). This approach is essentially what methods like CIBERSORT and MuSiC formalize, but it can be implemented with libraries like
sklearn.linear_model
(for NNLS viaLinearRegression
with non-negativity) or CVXOPT for constrained optimization. While basic, it provides a baseline probabilistic assignment (weights) of cell types to tissues. However, it may require careful normalization and perhaps feature selection (e.g. using known marker genes) to perform well. More advanced matrix factorization techniques can augment this – for example, integrative NMF can be used to learn latent expression programs common to single-cell and bulk data.
Example workflow: The SPOTlight method (designed for spatial transcriptomics) illustrates a deconvolution pipeline where single-cell data is used to derive cell-type-specific profiles (via seeded NMF), and then each mixed sample/spot is decomposed into those profiles with non-negative regression. This yields the proportion of each cell type in the mixture (pie charts, right) – analogous to mapping single-cell types to tissue bulk samples probabilistically.
Each of the above deconvolution methods inherently supports overlapping cell type distributions. If a cell type is present in all three tissues, these algorithms will simply estimate an appropriate fraction in each tissue rather than forcing it to belong to only one. The probabilistic nature comes from either explicit probability modeling (BayesPrism, TAPE) or from the continuous fraction output (fractions can be interpreted as probabilities of assignment).
Single-Cell to Bulk Alignment and Spatial Mapping Approaches
Another class of methods comes from the integration of single-cell data with spatial transcriptomics or other bulk-like data. These methods often produce a mapping of individual cells or cell types to spatial locations or bulk samples, often in a probabilistic manner (e.g. a probability that a given cell is in tissue A vs B vs C). We can leverage such algorithms by treating each bulk tissue sample as a “location” that needs to be deconvolved or populated with cells. Key examples:
-
Tangram: Tangram is a Python tool originally developed to map single-cell RNA-seq data onto spatial transcriptomics profiles. It uses a deep learning optimization to align scRNA-seq profiles to a reference atlas by maximizing the correlation between measured bulk/spatial expression and the aggregated expression of assigned single cells. In practice, Tangram takes as input a set of single-cell expression profiles and (spatial) bulk expression data, and returns a mapping of each cell (or each cell type) to coordinates in the tissue. If we apply it to three tissue bulk datasets, Tangram would assign each single cell a probability of originating from each tissue such that, when summed, the single-cell data recapitulates the bulk expression pattern. The output can be a probabilistic map (each cell gets a weight for each tissue) or a hard assignment if one chooses the best match. Tangram has been shown to produce “consistent spatial maps of cell types” and can deconvolve coarse expression data to single-cell resolution. In our context, one could use Tangram to generate a probability that each single cell belongs to tissue 1, 2, or 3, by treating tissue bulk profiles as three “spots.” Notably, Tangram can handle the case of shared cell types because it doesn’t require cell types to be unique to one domain – instead it finds an optimal many-to-many mapping that best explains the data.
-
cell2location: This is a principled Bayesian model (available in Python via Pyro/NumPyro) originally designed for spatial transcriptomics integration. cell2location models the observed bulk (or spatial spot) mRNA counts as a combination of contributions from various cell types, using a hierarchical model to account for technological differences. As the documentation states, “Cell2location estimates which combination of cell types in which abundance could have given the mRNA counts in the spatial data, while modeling technical effects.” In practice, one first uses scRNA-seq to learn reference expression signatures for each cell type, then cell2location finds the cell type abundance per location (tissue) that best explains the bulk. The result is a posterior distribution over cell type abundances per tissue, i.e. probabilistic assignment of cell types to tissues with uncertainties. Cell2location is well-suited for overlapping cell types: it uses hierarchical pooling across locations to improve detection of even small contributions, and it doesn’t force exclusivity (all tissues can have the same cell type in different proportions). It has been applied to complex organs and can resolve fine-grained cell types by “borrowing statistical strength” across related samples. For our use-case, one could treat the bulk replicates as “spatial spots” without coordinates – cell2location will still compute cell type proportions for each tissue sample. Being based on Bayesian regression, it provides credible intervals for those proportions.
-
Spatial Deconvolution Tools (adaptable to bulk): Many other methods developed for spatial transcriptomics deconvolution can be conceptually applied to bulk data as well. For example, Stereoscope (Andersson et al. 2020) uses probabilistic modeling (Poisson regression) to infer cell type proportions per spatial spot using scRNA-seq references – it could similarly output cell type proportions per bulk sample (with uncertainty from the modeled likelihood). CellDART (Nucleic Acids Res. 2022) trains neural networks on scRNA data to predict cell type mix and uses domain adaptation to apply them to spatial data. In their words, “CellDART… estimates the spatial distribution of cells defined by single-cell data using domain adaptation of neural networks”. One could imagine applying CellDART by treating each bulk sample as a “pseudo-spatial spot”: the model would yield the distribution of cell types for each tissue. SPOTlight (which we highlighted in the figure) uses a seeded NMF approach to decompose each spatial spot into cell type topics; it can likewise be used on bulk samples (which are just spots with no spatial coordinate). All these methods output cell type composition vectors for each sample, which is effectively the probabilistic assignment sought (they inherently allow shared cell types across all spots/samples).
-
CytoSPACE: A newer method (Nat. Biotech. 2023) that performs high-resolution alignment of single cells to spatial profiles. CytoSPACE’s optimization approach places individual cells into spatial locations such that aggregated expression matches the observed profiles. If repurposed for bulk, CytoSPACE could, for example, assign each single cell to one of the three tissues (or even to specific bulk replicates) in an optimal way. In doing so, it can output probabilities or frequencies (since it often uses multiple stochastic initializations). CytoSPACE was shown to handle scenarios with noise and overlapping cell types well. While primarily designed for spatial coordinates, ignoring the spatial aspect still leaves an expression-matching assignment which is relevant here.
Bottom line: The above mapping approaches treat the bulk data as a target to which single-cell data must align. They often yield either a cell-level assignment or a cell-type fraction per sample. These methods excel in complex scenarios because they use all genes and often incorporate regularization or prior info to handle technical differences. For tissues with overlapping cell type profiles, methods like Tangram and cell2location are advantageous – they do not assume exclusivity of cell types to any one tissue and can even highlight subtle tissue-specific biases in cell state. The output being probabilistic (e.g. cell2location’s posterior, or Tangram’s probability map of cells to tissue) means you can say, for example, “Cell X has 60% probability origin in tissue A, 30% in B, 10% in C” if its gene pattern is somewhat ambiguous, which is exactly the flexibility we need.
Latent Space Integration and Domain Adaptation
Beyond explicit deconvolution, one can approach the problem via integrative data analysis techniques that embed single-cell and bulk data into a shared space or otherwise transfer labels across domains. These methods are useful for custom Python pipelines when existing tools are insufficient:
-
Joint Dimensionality Reduction: We can project bulk and single-cell data into a common low-dimensional space (for instance, via PCA, canonical correlation analysis, or nonnegative matrix factorization) and then compare their representations. The idea is that cells and bulk samples that are similar will cluster together in this latent space despite data-type differences. For example, one could concatenate the bulk sample profiles with the single-cell profiles (perhaps after scaling) and perform PCA or use an algorithm like integrative NMF or Multiple Factor Analysis to find components that explain both datasets. Cells can then be assigned to tissues by proximity: e.g., if a single cell’s latent coordinates are closest to those of tissue A’s bulk replicates, we assign a higher probability to tissue A. Anchor-based methods (like Seurat v3’s CCA + mutual nearest neighbors, or Scanorama’s manifold alignment) could also be adapted – these find correspondences between datasets. In Python, one could use Scanorama or Harmony (though originally for scRNA batch integration) to merge bulk and single-cell data by treating bulk samples as “pseudo-cells” – yielding an integrated atlas where each cell is annotated with a batch (tissue) identity probability. While these approaches are not out-of-the-box probabilistic, one can derive probabilities by distance or by using a classifier in the latent space.
-
Adversarial Domain Adaptation: In machine learning, domain adaptation networks can learn a representation that is predictive of tissue label while being invariant to whether the input came from bulk or single-cell. A framework like scAdapt (as noted in a recent overview) uses a “virtual adversarial domain adaptation network for single-cell RNA-seq data classification across platforms”. One could train a neural network classifier to predict the tissue of origin for a given expression profile, using the bulk samples as positive examples for each tissue. To train on single cells (which lack tissue labels), one can use adversarial training: a domain discriminator tries to distinguish single-cell vs bulk, and the encoder is trained to confuse it, thereby forcing the network to learn features common to both domains. At the same time, the bulk samples’ tissue labels supervise the classification. This way, single cells will be mapped to a position in feature space and get a predicted tissue probability, even though they were unlabeled. This custom approach would yield a probability distribution (softmax output) for tissue labels on each single cell. While implementing such a network requires some coding (using PyTorch/TensorFlow), it leverages all data and can capture non-linear patterns. Researchers have started applying deep domain adaptation in single-cell analyses to combat batch effects, so applying it to the single-cell vs bulk modality gap is feasible.
-
Matrix Factorization & Topic Modeling: Another integration strategy is to factorize the bulk expression matrix alongside the single-cell data matrix to find shared metagenes or metacell patterns. For instance, one could perform a coupled matrix factorization where one factor matrix corresponds to gene expression programs (common to both single-cell and bulk) and another factor corresponds to cell or sample loadings. Tools like LIGER (R) implement integrative nonnegative matrix factorization (iNMF) to find shared and dataset-specific factors across scRNA-seq datasets. In Python, one might use libraries for nonnegative matrix factorization or even Latent Dirichlet Allocation (treating cell transcriptomes as “topics”) to identify clusters of cells that align with bulk sample differences. If, for example, factor 1 represents a gene program highly expressed in tissue A, cells with high loading on factor 1 are likely from tissue A with high probability. This is a more unsupervised approach: you might cluster single cells by factor loadings and see that clusters align with tissues. Some spatial methods like SPOTlight effectively use topic modeling (NMF “topics” per cell type and per spot) – similar ideas could help map cell types to tissues by finding a low-dimensional representation capturing tissue-specific variance.
-
Variational Bayesian Models: Building on tools like scVI (single-cell variational inference) and totalVI, one can conceive a hierarchical model for bulk and single-cell. For example, DestVI (Stuart et al., 2022) extends scVI to spatial data by learning cell-type–specific latent profiles and spot-level proportions. A DestVI-like model for our case would learn an embedding for each cell that captures its cell type and another variable for tissue influence, then infer tissue composition. This goes beyond straightforward use, but since scVI-tools are in Python, a skilled user could adapt them (DestVI is already implemented in Python via scvi-tools). The advantage is a fully probabilistic model: DestVI yields a posterior over cell type proportions per spot and can even model continuums of cell identity. For overlapping cell types, such a model can learn slight gene expression shifts as continuous latent factors rather than forcing discrete differences.
In summary, integration and domain adaptation techniques provide flexible, custom solutions when off-the-shelf tools don’t exactly meet the needs. They might require more work to set up, but they enable probabilistic tissue assignment by either clustering cells with bulk samples or by directly predicting tissue labels with confidence scores. For overlapping cell types, these approaches are effective because they leverage subtle shifts in gene expression or latent dimensions to tease apart which tissue a cell likely came from, without assuming unique cell types for each tissue.
Considerations for Overlapping Cell Type Profiles
When tissues share many cell types, a few additional points are important:
-
Use Probabilistic Outputs: Instead of forcing a binary decision, use methods that return a probability or fraction. For example, BayesPrism’s posterior or a classifier’s softmax can tell you if a cell is, say, 40% vs 60% split between two tissues. This is more informative given uncertainty. Many tools above inherently do this – e.g., cell2location gives a distribution over cell counts per tissue spot, which can be normalized to probabilities.
-
Incorporate Tissue-Specific Markers if Known: If certain genes are uniquely upregulated in a cell type when it’s in tissue A vs tissue B, including that information can improve mapping. Approaches like CIBERSORTx allow specifying marker genes and can output cell-type-specific expression for each tissue, helping to identify, for instance, “microglia in cornea vs microglia in conjunctiva” differences. Similarly, the TAPE autoencoder effectively learns such tissue-dependent expression changes for each cell type.
-
Multi-Modal Integration: Although not asked, keep in mind that if other data (e.g. spatial location, ATAC-seq, or protein expression) is available, integrative models (like Tangram for histology or multi-omics factorization) can increase confidence in assignments by adding complementary evidence.
-
Validation: It can be useful to validate the assignments if possible. For example, if one cell cluster is assigned mostly to “placenta region A” by the algorithm, check if known literature or small-scale experiments find that cluster’s marker genes in that region. Tools like Tangram validate mappings by held-out genes – you could do similar: hold out some tissue-specific marker genes and see if the mapping still recovers them in the correct tissue.
-
Ready-to-use vs Custom: Many of the mentioned advanced methods (cell2location, Tangram, Scaden, etc.) are available as Python packages with documentation or tutorials. For instance, Tangram has a tutorial for mapping single-cell to Visium data, cell2location provides notebooks for integrating human lymph node data, and Scaden has an easy-to-use command-line interface. Leveraging these can save time. On the other hand, custom pipelines (using PyTorch, scvi-tools, scanpy, etc.) offer flexibility – one can combine elements (e.g. use scVI to get a latent space, then KNN-classify bulk samples). The choice depends on the user’s comfort: since R-based solutions were excluded, it’s fortunate that several Python implementations exist for the key tasks.
References: The methods and tools discussed are supported by recent literature. For example, MuSiC’s efficacy on closely related cell types is noted in its publication, and BayesPrism’s Bayesian framework for bulk–single-cell integration is described by Chu et al.. Deep learning approaches like Scaden and TAPE demonstrate how neural networks improve deconvolution robustness and adapt to tissue-specific signals. For spatial-to-single-cell alignment, Tangram and cell2location provide probabilistic mappings that can be applied to bulk data. Domain adaptation techniques are highlighted in emerging studies (e.g., scAdapt in Brief. Bioinf. 2021, CellDART in NAR 2022). By combining insights from these resources, one can confidently assign single-cell derived cell types to tissue-of-origin in a probabilistic manner, even when the tissues share many of the same cell populations. This integrated analysis will help unravel tissue-specific expression programs and cell distribution differences that would be impossible to discern by looking at single-cell or bulk data alone.
Sources:
- MuSiC deconvolution for closely related cell types
- CIBERSORTx inferring cell type fractions and specific expression
- BayesPrism Bayesian integration of bulk and single-cell (posterior over fractions)
- Scaden deep learning model for cell composition (robust across tissues)
- TAPE autoencoder for tissue-adaptive deconvolution (predicts fractions & expression)
- Tangram alignment of single cells to spatial (bulk) reference, yielding cell maps
- cell2location Bayesian model mapping cell types to locations (with technical effects)
- CellDART domain adaptation neural network for cell distribution mapping
- Integrative NMF/CCA for multi-dataset alignment (find shared factors across batches)
- scAdapt adversarial domain adaptation for cross-platform cell classification