Single-cell Multimodal Omics
- Single-cell multimodal omics is an integrated approach that measures transcriptome, chromatin accessibility, proteome, and spatial context from individual cells to map cellular diversity.
- Recent computational advances employing hierarchical VAEs, graph neural networks, and transformer-based methods have enhanced robust integration and scalable analysis of complex datasets.
- Innovative modeling strategies addressing sparsity, batch effects, and modality-specific noise have improved cell type annotation and regulatory network reconstruction in heterogeneous tissues.
Single-cell multimodal omics refers to the integrated quantification of multiple molecular modalities—such as transcriptome, chromatin accessibility, proteome, and spatial morphology—from the same individual cell. This technological and computational advance enables comprehensive characterization of cellular identity, developmental trajectories, and regulatory mechanisms within heterogeneous tissue contexts. Major challenges include extreme sparsity and dimensionality of data, batch and domain effects, heterogeneity of feature spaces, unreliable cell correspondence, and modality-specific noise profiles. Recent developments unite large-scale foundation models, hierarchical variational autoencoders, graph neural networks, and specialized tokenization strategies to achieve robust, scalable integration and downstream biological discovery.
1. Molecular Modalities and Data Structures
Single-cell multimodal omics encompasses several experimental platforms:
- Dual- and triple-omics (e.g., Multiome, CITE-seq): Simultaneous measurement of scRNA-seq (gene expression), scATAC-seq (chromatin accessibility), and/or ADT (surface protein) from the same cell or nucleus.
- Spatial omics: Integration of single-cell transcriptomics with morphology from histology images (H&E or IF), spatial coordinates, and neighborhood relationships (e.g., Xenium, COSMx platforms) (Acosta et al., 13 Aug 2025, Yang et al., 8 Jul 2025).
- Multi-assay compendia: Large atlases combining scRNA, snRNA, snATAC, spatial transcriptomics, and in some models, text or image-based metadata (Li et al., 30 Sep 2025, Wang et al., 9 Jan 2026).
Feature sets typically differ in resolution (e.g., ∼20,000 genes, ∼100,000 ATAC peaks, 130–200 proteins) and distribution (counts, binary, overdispersed, ordinal/ranked), leading to nontrivial fusion and alignment challenges.
2. Foundational Computational Frameworks
Single-cell multi-omic integration frameworks fall into several technical classes (Stanojevic et al., 2022):
| Class | Core Principle | Example Methods |
|---|---|---|
| Statistical projection | Linear correlation/cov. | CCA, PLS, Seurat v3 MNN, MAESTRO |
| Matrix factorization | Shared latent factors | MOFA+, scAI, LIGER, BREM-SC |
| Network/graph models | Affinity/graph fusion | SNF, Joint Diffusion, WNN (Seurat v4) |
| Manifold alignment | Geometry, optimal transport | MATCHER, MMD-MA, SCOT, Pamona |
| Deep learning frameworks | Generative, adversarial | VAEs (scMVAE, totalVI), AEs (BABEL), GANs |
| Graph neural networks | Message passing | scMoGNN (Wen et al., 2022), MoRE-GNN (Wang et al., 8 Oct 2025) |
| Transformer-based models | Tokenization, cross-attn | scMamba (Yuan et al., 25 Jun 2025), scMoFormer (Tang et al., 2023), Nephrobase Cell+ (Li et al., 30 Sep 2025) |
Variational autoencoder (VAE)-based methods—including -VAE, hierarchical DAG-guided VAEs (CAVACHON (Hsieh et al., 2024)), and product-of-experts/posterior fusion architectures—have emerged as generalizable, modality-agnostic backbone models. These often operate in tandem with adversarial objectives (domain-invariant discriminator), contrastive losses for modality alignment, and masked reconstruction losses to handle missing features (Sun et al., 28 Oct 2025, Hsieh et al., 2024).
3. Advanced Modeling Strategies and Scalability
Recent innovations address persistent obstacles in multimodal integration:
- Patch-based tokenization: Raw genomic matrices are segmented into contiguous regions (“patches”), each linearly embedded with learnable positional encodings to conserve spatial/genomic context (Yuan et al., 25 Jun 2025).
- State-space duality and TTT layers: Sequence modeling via state-space layers or test-time training (TTT) achieves scalable long-range interaction modeling, essential for high-dimensional genomics where standard attention mechanisms fail (Meng et al., 2024, Yuan et al., 25 Jun 2025).
- Sparse Mixture-of-Experts: For large foundation models, expert routing diversifies representation capacity and prevents mode collapse when integrating disparate assays (Li et al., 30 Sep 2025).
- Contrastive learning and modality alignment: Joint embedding spaces are regularized via InfoNCE or cosine similarity objectives to enforce alignment between modalities (e.g., RNA/ATAC, RNA/protein, gene/image), enabling accurate cell-type annotation, trajectory inference, and regulatory network reconstruction (Yang et al., 8 Jul 2025, Yuan et al., 25 Jun 2025, Wang et al., 9 Jan 2026).
- Graph-based fusion and attention: Heterogeneous graphs encoding cell–feature, feature–feature, and even spatial relationships undergo convolutional and attention-based updates to extract latent structure and unify biological signals from disparate modalities (Wen et al., 2022, Wang et al., 8 Oct 2025, Tang et al., 2023).
- Handling missing modalities and batch effects: Cross-cohort integration via product-of-experts VAE encoders and domain-specific latent shifts enables imputation of wholly unobserved modalities and correction of complex batch/domain effects (Arriola et al., 2024, Li et al., 30 Sep 2025).
Scalability benchmarks indicate nearly linear runtime and memory growth with increasing cell count for modern architectures (e.g., scMamba: 377k cells, <6 h, <80 GB GPU; Nephrobase Cell+: 39.5 M profiles, ∼100 B pretraining tokens) (Yuan et al., 25 Jun 2025, Li et al., 30 Sep 2025). OT-based methods falter above ∼30k cells due to quadratic cost in pairwise couplings (Sun et al., 28 Oct 2025).
4. Quantitative Performance and Biological Insights
Integration efficacy is assessed by multi-metric suites, including ARI, NMI, silhouette, cLISI/iLISI (label/batch mixing), kBET, PCR (batch regression), and biological signal preservation (Sun et al., 28 Oct 2025, Li et al., 30 Sep 2025). Key findings:
- Clustering and cell-type annotation: Organ-specialized models (Nephrobase Cell+) achieve ARI/NMI >0.8 on kidney, cross-species zero-shot accuracy >90%; scMRDR and scMamba outperform standard methods (Seurat, GLUE, Harmony) on batch correction and biology preservation (Sun et al., 28 Oct 2025, Li et al., 30 Sep 2025, Yuan et al., 25 Jun 2025).
- Trajectory and regulatory inference: Reservoir-based regressors (Echo State Networks), manifold alignment, and disentangled latent models reveal nonlinear co-variation and lineage progressions, enabling both pseudotime mapping and accurate peak-to-gene linkages (Mehta et al., 2023, Mao et al., 2022, Yuan et al., 25 Jun 2025).
- Spatial and image integration: Dual-encoder architectures establish cross-modal representations linking cell morphology to gene/protein expression; models such as PAST enable virtual staining and survival prediction purely from H&E pathology (Yang et al., 8 Jul 2025, Acosta et al., 13 Aug 2025).
- Handling missing data: SC⁵ VAE achieves state-of-the-art imputation, clustering, and classification in cross-cohort, missing-modality contexts (Arriola et al., 2024).
5. Interpretability, Flexibility, and Biological Relevance
- Disentanglement and conditional independence: Hierarchical models (CAVACHON) use DAGs to separate common and distinct latent factors, supporting interpretable decomposition of biological signals and explicit modeling of causal/conditional relationships between modalities (Hsieh et al., 2024).
- Feature co-clustering: Information-theoretic approaches (scICML) execute matched co-clustering of features within and across modalities, reflecting true regulatory dependencies and denoising complex, noisy multiome data (Zeng et al., 2022).
- Knowledge-augmented modeling: Integration of open-world biomedical knowledge—via LLM–based RAG pipelines—enriches cell metadata and improves textual/omics alignment, enabling interpretable, robust cell–text retrieval and annotation in real-world, noisy datasets (Wang et al., 9 Jan 2026).
- Modalities beyond genomics: Current frameworks (scMRDR, PAST, Nephrobase Cell+) naturally extend to more than two modalities—epigenome, proteome, spatial context, and even clinical or language data—subject to appropriate encoder/decoder adaptation and regularization (Sun et al., 28 Oct 2025, Yang et al., 8 Jul 2025, Wang et al., 9 Jan 2026).
6. Current Limitations and Future Trajectories
Unresolved issues include robustness of adversarial training (mode collapse, instability), feature aggregation strategies that may lose locus-specific information, trade-offs between scalability and raw feature-level interpretability, and the need for more comprehensive, domain-adaptive training data (especially for spatial and image omics) (Sun et al., 28 Oct 2025, Yang et al., 8 Jul 2025, Li et al., 30 Sep 2025). Future directions highlight:
- Extension to spatial-dynamic multi-omics and perturbation/CRISPR screens (Sun et al., 28 Oct 2025, Yang et al., 8 Jul 2025).
- Incorporation of knowledge graphs and feature-level regulatory networks into hierarchical frameworks (Hsieh et al., 2024).
- Pretraining on even larger, globally harmonized tissue and assay compendia, including multi-organ atlases and rare cell states (Li et al., 30 Sep 2025, Wang et al., 9 Jan 2026).
- Enhanced interpretability via co-TTT layers, feature attribution, and alignment with experimental/clinical outcomes (Meng et al., 2024, Yang et al., 8 Jul 2025).
- Generalization to truly open-world modalities—text, images, spatial context—with reliability-aware alignment and curriculum learning (Wang et al., 9 Jan 2026).
Single-cell multimodal omics, leveraging scalable, domain-informed, and biologically regularized computational models, is now central to the next generation of cellular and tissue-level regulatory mapping, biomarker discovery, and functional annotation in both benchmark and clinical settings.