Single-cell Multimodal Assays

Updated 8 January 2026

Single-cell multimodal assays are experimental techniques that simultaneously capture multiple molecular layers, providing a comprehensive view of cellular state and heterogeneity.
They integrate data from transcriptomics, chromatin accessibility, protein abundance, and spatial contexts using advanced methods like matrix factorization, CCA, and deep generative models.
These assays enable detailed analysis of cell differentiation, immune profiling, and tumor microenvironments, facilitating biomarker discovery and precision diagnostics.

Single-cell multimodal assays comprise experimental techniques and analytical frameworks that enable the simultaneous measurement and integrated analysis of multiple molecular modalities (such as transcriptome, chromatin accessibility, protein abundance, and spatial information) in individual cells. By capturing orthogonal layers of cellular regulation within the same sample, these assays provide a multidimensional view of cell state, lineage, and microenvironment, facilitating systems-level inference of biological heterogeneity, regulatory circuits, and spatial context. Their proliferation has driven the rapid development of sophisticated integration, alignment, feature selection, and representation learning methods, collectively establishing single-cell multimodal omics as a foundational paradigm in contemporary cell biology, cancer research, immunology, and developmental studies.

1. Molecular Modalities and Experimental Workflows

Single-cell multimodal assays are designed to profile distinct molecular layers, each necessitating specialized chemistry and computational preprocessing.

Transcriptomics (scRNA-seq) quantifies per-cell gene expression using droplet- or plate-based sequencing platforms.
Chromatin accessibility (scATAC-seq) measures open chromatin regions, typically via transposase tagging and sequencing.
Surface protein abundance (CITE-seq, REAP-seq) assaying cell-surface markers via oligonucleotide-labeled antibodies.
Spatial transcriptomics (Xenium, Visium) localizes gene expression within intact tissue slices using in situ barcoding or imaging.
Triple-omics (scNMT, Multiome) combines up to three modalities (RNA, ATAC, methylation, ADT) within the same cell.

Representative protocols include:

10x Genomics Multiome ATAC+GEX (nuclei isolation, transposition, dual barcoding, sequencing)
CITE-seq (cell staining with DNA-antibodies, droplet encapsulation, library preparation for RNA and ADT)
Xenium spatial transcriptomics (intact tissue imaging, spatial barcoding, per-cell assignment of transcripts and coordinates) (Mehta et al., 2023, Acosta et al., 13 Aug 2025).

Produced data typically consist of high-dimensional, sparse count matrices, one per modality and cell. Careful quality control (doublet removal, low-complexity filter), normalization (library size, variance stabilization), and modality-specific transformations are foundational for downstream integration.

2. Computational Strategies for Data Integration and Alignment

Given the heterogeneity in measurement distributions, scale, and modality coverage, integration methods seek to embed data into a unified latent space amenable to biological inference. Core algorithmic paradigms include:

Matrix factorization: Shared-factor models (e.g., MOFA+) posit that all omics layers can be approximated as a product of shared and modality-specific low-dimensional factors, with regularization to separate joint from unique signals.
CCA and anchor-based alignment: Seurat v3/v4 employs canonical correlation analysis (CCA), mutual nearest-neighbor “anchors,” and iterative batch correction to align modalities and remove platform effects (Stanojevic et al., 2022, Anaissi et al., 1 Jan 2026).
Graph-based fusion: Methods like citeFUSE and network fusion build cell–cell similarity networks for each modality, followed by iterative diffusion and fusion to reach consensus (Stanojevic et al., 2022).
Optimal transport: Gromov-Wasserstein frameworks (SCOT, Pamona) minimize the discrepancy in intra-modal and cross-modal distances via a transport plan, generalizing MNN for partial overlap scenarios (Stanojevic et al., 2022).
Deep generative models: Variational autoencoders (scMVAE, totalVI, BABEL) employ modality-specific encoders/decoders to learn cross-modal translation and imputation in a probabilistic latent space (Stanojevic et al., 2022).
Graph neural networks and transformers: Heterogeneous graphs (scMoGNN, scMoFormer) and sequence models (scFusionTTT, scMamba) represent cells, features, and modalities as nodes or tokens with explicit message passing, attention, or state-space computation (Tang et al., 2023, Wen et al., 2022, Meng et al., 2024, Yuan et al., 25 Jun 2025).

The selection of integration strategy depends on coverage (paired/unpaired), statistical properties (sparsity, nonlinearity), scalability requirements, and desired interpretability.

3. Representation Learning and Disentanglement Frameworks

Advanced models explicitly seek to disentangle shared (biology-driven) and modality-specific (assay- or platform-driven) variation:

Hierarchical VAEs such as CAVACHON introduce layered latent variables: $z_{0}$ for shared structure, $z_{1}, z_{2}$ for modality-specific components, with the generative model

$p(x_{1},x_{2},z_{0},z_{1},z_{2})=p(z_{0})p(z_{1}|z_{0})p(z_{2}|z_{0})p(x_{1}|z_{0},z_{1})p(x_{2}|z_{0},z_{2})$

The variational posterior is mean-field Gaussian, and the ELBO penalizes deviations from both shared and specific priors, enabling downstream differential analysis and cluster discovery (Hsieh et al., 2024).

Unpaired, multimodal β-VAEs (scMRDR) use a single encoder–decoder architecture where each cell is mapped into modality-shared $z_u$ and modality-specific $z_s^{(m)}$ latents. Key loss components include β-weighted KL divergence, isometric regularization for latent geometry preservation, adversarial modality-alignment (cross-modal discriminator), and masked reconstruction for missing features:

$\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{recon}} + \beta \,\mathcal{L}_{\mathrm{KL}} + \lambda\,\mathcal{L}_{\mathrm{align}} + \gamma\,\mathcal{L}_{\mathrm{preserve}}$

This architecture scales to millions of unpaired cells and multiple omics layers (Sun et al., 28 Oct 2025).

These frameworks excel at capturing high-order relationships, decomposing biological versus technical sources of variation, and supporting modality-specific or joint differential analyses.

4. Feature Selection, Benchmarking, and Preprocessing Pipelines

Efficient computational workflows are essential for handling the volume and heterogeneity of single-cell multimodal data:

Unsupervised multimodal feature selection (mmDUFS) uses Laplacian-based scoring operators $P_{\mathrm{shared}}$ and $Q_{x}, Q_{y}$ to identify features in both modalities smooth on joint or modality-specific manifolds, with differentiable gates for selecting informative vs. nuisance features (Yang et al., 2023).
Preprocessing as per Anaissi et al. requires normalization (SCTransform, log-normalization, TF-IDF, Linnorm, CPM, Scran), batch correction and dimension reduction (PCA, UMAP, PHATE, t-SNE). Benchmarking demonstrates that SCTransform and TF-IDF are optimal for robust integration across modalities, Harmony and FastMNN outperform for batch/multimodality correction, and UMAP consistently achieves the highest visualization and clustering scores (Silhouette, ARI, Calinski-Harabasz) (Anaissi et al., 1 Jan 2026).
Metric selection includes ARI and NMI for cluster agreement, Silhouette for cohesion/separation, LISI for mixing/purity, and Pearson correlation for imputation tasks. Validation is performed on held-out cohorts and imputation of missing modalities (Anaissi et al., 1 Jan 2026, Arriola et al., 2024).

Practical guidelines favor combinatorial pipelines—modality-appropriate normalization, scalable integrators, and robust low-dimensional embedding and clustering.

5. Foundation Models and High-dimensional Context Integration

Recent advances leverage large foundation models trained on massive paired multimodal data:

scMamba treats genomics regions as “words” and cells as “sentences” via patch-based tokenization, using Mamba2 state-space blocks (SSD) for efficient long-context encoding. Contrastive and cosine similarity regularization align modalities without feature selection, resulting in superior omic alignment and biological resolution on datasets up to 377K cells (Yuan et al., 25 Jun 2025).
CellSymphony and PAST integrate transformer-derived transcriptomic and vision transformer-based morphology embeddings, fused via multimodal transformers or contrastive learning, enabling highly accurate cell-type annotation, spatial niche discovery, gene expression prediction, virtual IHC, and survival modeling directly from histopathology (Acosta et al., 13 Aug 2025, Yang et al., 8 Jul 2025).
scFusionTTT introduces Test-Time Training (TTT) layers for linear-complexity context modeling, preserving gene/protein order and combining masked autoencoding with cross-modal fusion. The model surpasses prior attention-based approaches in cluster purity, and maintains order-sensitivity—shuffling the input order degrades ARI and NMI by 0.07–0.08 (Meng et al., 2024).
SC⁵ generative topic models enable joint analysis and robust cross-modality imputation across cohorts with missing modalities, outperforming regression and prior VAE-based frameworks in ARI/NMI and imputation accuracy (Arriola et al., 2024).

These models have elevated the precision and scalability of multimodal integration, supporting downstream mechanistic discovery, virtual molecular phenotyping, and comprehensive spatial–molecular mapping.

6. Applications and Biological Insights

Single-cell multimodal assays and their associated computational frameworks have yielded deep biological insights:

Deconvolution of differentiation trajectories (e.g. hematopoiesis, erythroid lineage, immune activation) via integrated RNA–ATAC–protein embedding and trajectory conservation metrics (Mehta et al., 2023, Yuan et al., 25 Jun 2025).
Cross-modal cell-type annotation, recovering fine immune cell subtypes and rare populations (B-cell subtypes, myoepithelial cells, Naive/MK/E differentiation) (Yuan et al., 25 Jun 2025, Acosta et al., 13 Aug 2025, Yang et al., 8 Jul 2025).
Spatial niche discovery in cancer microenvironments (B/T-cell enrichment, stroma compartmentalization, glandular gradients), with single-cell resolution in complex tissues (Acosta et al., 13 Aug 2025).
Modality-specific differential analysis (chromatin-driven, transcript-driven regulation) and imputation of unmeasured features in cross-cohort studies (Hsieh et al., 2024, Arriola et al., 2024).
Virtual molecular staining and survival prediction from archival pathology, demonstrating clinical utility beyond traditional sequencing (Yang et al., 8 Jul 2025).

The ability to fuse, align, and interrogate multiple omics modalities in individual cells has shifted the landscape of mechanistic cell biology, disease biomarker discovery, and precision diagnostics.

7. Limitations and Future Directions

Despite rapid progress, several computational and experimental challenges remain:

Extension to >2 or 3 modalities (e.g. RNA, ATAC, protein, spatial, methylation) with missing data per cohort or batch, requiring robust imputation and disentanglement architectures (Sun et al., 28 Oct 2025, Arriola et al., 2024).
Scalability for datasets with millions of cells and features, where memory and compute constraints invite graph sampling, sparse Laplacians, online training, and optimized foundation models (Yuan et al., 25 Jun 2025, Anaissi et al., 1 Jan 2026).
Incorporation of biological priors and regulatory networks in integration models for improved interpretability and mechanistic insight (Yuan et al., 25 Jun 2025, Acosta et al., 13 Aug 2025).
Automated hyperparameter tuning for pipelines with complex dependency structures (patch size, gate noise, regularization, fusion mechanisms) (Meng et al., 2024, Anaissi et al., 1 Jan 2026).
Application to spatial–temporal data, perturbation-response modeling, and clinical translational tasks (virtual staining, risk stratification) (Yang et al., 8 Jul 2025).

Future research will likely emphasize data-driven graph construction, multimodal pre-training, spatial–molecular–phenotypic fusion, and clinical decision support grounded in single-cell multilineage resolution.

Key references: (Sun et al., 28 Oct 2025, Yuan et al., 25 Jun 2025, Mehta et al., 2023, Hsieh et al., 2024, Yang et al., 2023, Anaissi et al., 1 Jan 2026, Tang et al., 2023, Wen et al., 2022, Stanojevic et al., 2022, Meng et al., 2024, Acosta et al., 13 Aug 2025, Yang et al., 8 Jul 2025, Arriola et al., 2024).