RNA-FM Embeddings for RNA Informatics
- RNA-FM embeddings are continuous, high-dimensional representations derived from pre-trained transformer and RNN models that extract fine-grained RNA sequence and structural information.
- They enable alignment-free, framework-agnostic feature extraction and improve tasks such as RNA structure prediction, function annotation, and spatial transcriptomics with notable metrics like F1 scores and RMSD improvements.
- Various architectures—from 12-layer transformer models to 2D RNN-based k-mer embeddings—are tailored to capture both primary sequences and higher-order structural signals, aiding gene clustering and protein-RNA binding predictions.
RNA-FM embeddings are continuous, high-dimensional representations derived from foundation models trained on large-scale RNA sequence data. The term “RNA-FM” generally refers to pre-trained models, especially those based on deep transformer or recurrent neural network (RNN) architectures, that can extract, encode, and transfer fine-grained structural and functional information across diverse RNA sequences and transcriptomics contexts. Embeddings produced by RNA-FMs have demonstrated exceptional capacity for framework-agnostic feature extraction, downstream transferability, and improved performance on tasks such as RNA structure prediction, function annotation, sequence design, and spatial transcriptomics. These methods have considerably influenced RNA informatics by enabling alignment-free, scalable, and structure-aware sequence representations (Chen et al., 2022).
1. Foundational Architectures for RNA-FM Embeddings
Early RNA-FM embeddings were established using sequence-to-vector pipelines, typically leveraging deep bidirectional transformer encoders parameterized in the style of BERT or similar architectures. For instance, the foundational model introduced by Chen et al. utilized a 12-layer bidirectional transformer (hidden size 640, 20 attention heads/layer, layer normalization, and residuals) with a vocabulary including both canonical bases and IUPAC ambiguity codes (Chen et al., 2022). Tokenized RNA sequences (up to length 1,024) are embedded and subjected to self-supervised learning objectives, most commonly masked language modeling (MLM). In this paradigm, the model learns to reconstruct randomly masked nucleotides based solely on non-masked context, thus extracting rich contextualized per-nucleotide 640-dimensional embeddings as well as global mean-pooled sequence embeddings.
Other frameworks, such as the latent transcriptome model (“RNA-FM” (Trofimov et al., 2018)), approached k-mer-level embedding using a recurrent neural network (bi-LSTM, two layers × 256 hidden units per direction) to transform one-hot-encoded k-mers (k=24) into a compact 2-dimensional representation. This setting enables both the encoding of sequence similarity and the modeling of sample-specific abundance by concatenating low-dimensional k-mer embeddings with learnable patient/sample embeddings.
2. Embedding Extraction and Mathematical Formalism
The mathematical mapping for RNA-FM embeddings depends on the model backbone. In transformer-based RNA-FMs (Chen et al., 2022, Si et al., 27 Jan 2026), tokenized input is projected via a learned embedding matrix and added to a positional encoding . The stacked transformer blocks process these representations to generate per-position hidden states: The final embedding output is the hidden state after the final transformer block, typically . For sequence-wise applications, the mean-pooled embedding across positions yields a 640-dimensional global feature.
In latent transcriptome (k-mer) models, per-k-mer embedding is formalized as
where is a bi-LSTM encoder and is the one-hot k-mer. To incorporate sample context, each individual is associated with an embedding . These are concatenated and passed to a multilayer perceptron for abundance prediction: and optimized by quadratic loss over observed counts.
A summary of the core RNA-FM embedding extraction pipelines is provided below:
| Model | Tokenization | Embedding Dim | Encoder | Embedding Output |
|---|---|---|---|---|
| RNA-FM (Chen) | {A, C, G, U, ...} | 640 | 12L BERT/Transf. | L×640 (per-nucleotide), mean-pool |
| Latent Transcriptome | k-mer ({A,C,G,T}) | 2 | 2L Bi-LSTM (256u) | 2 (per k-mer), +sample embedding |
3. Key Properties and Interpretability of RNA-FM Embeddings
Empirical evaluations have shown RNA-FM embeddings to encode both primary sequence and higher-order structure and function signals. Unsupervised visualizations (e.g., UMAP of per-sequence embeddings) reveal that embeddings cluster distinctly by RNA biotype (housekeeping vs regulatory, small vs long ncRNA), in contrast to random or one-hot encoding (Chen et al., 2022). Evolutionary relationships, such as species pseudotime progression of lncRNAs, are recapitulated in the embedding space without explicit phylogenetic supervision.
At the k-mer level, latent transcriptome embeddings in cluster k-mers with shared sequence or abundance profiles. Distinct exons manifest as discrete bands; common k-mers from homologous genes (e.g., ZFX/ZFY) cluster tightly; private k-mers associated with patient-specific SNPs or indels are visually separable without external genome annotation (Trofimov et al., 2018).
4. Downstream Application Domains
RNA-FM embeddings have been applied to a spectrum of tasks including, but not limited to:
- RNA secondary structure prediction: Feature-based models using per-nucleotide RNA-FM embeddings as input (e.g., ResNet-32) surpass state-of-the-art baselines on base-pair precision/recall/F1 (ArchiveII600: P=0.936, R=0.951, F1=0.941) and generalize across multiple datasets (Chen et al., 2022).
- RNA 3D structure/contact regression: Combined with equivariant graph networks, sequence and per-nucleotide embeddings enable end-to-end 3D modeling, achieving atomic RMSD ≈4 Å versus >10 Å with sequence-only input (Chen et al., 2022).
- Expression and ribosome load prediction: CNNs using RNA-FM embeddings improve mean ribosome load regression (R²=0.876 vs 0.860 for one-hot baseline) (Chen et al., 2022).
- Protein–RNA binding: Embeddings used wholesale or as input to PrismNet pipelines increase AUPRC in protein–RNA binding site prediction (mean AUPRC: Seq+RNA-FM=0.824 vs Seq only=0.815) (Chen et al., 2022).
- RNA design and inverse folding: In latent diffusion frameworks, RNA-FM embeddings serve as frozen feature extractors, with further compression into a lower-dimensional latent space for efficient diffusion-based optimization of 3D and structural objectives (Sequence Recovery ↑, MFE, SS, LDDT) (Si et al., 27 Jan 2026).
5. Comparative and Contextual Models
Multiple groups have expanded on RNA-FM’s core paradigm:
- mRNA2vec (Zhang et al., 2024) combines a teacher-student (data2vec) self-supervised setup with contextual masking and auxiliary tasks (minimum free energy regression and secondary structure classification) to improve mRNA stability and translation efficiency tasks.
- OmniGenome (Yang et al., 2024) extends RNA-FM to explicitly align sequence and secondary structure modalities via structure-contextualized multi-objective pretraining, establishing bidirectional mappings (Seq2Str/Str2Seq) that substantially enhance zero-shot structure prediction and design tasks (e.g., solving 74% of EternaV2 RNA design puzzles).
- SAGE-FM (Zhan et al., 21 Jan 2026) transfers the FM paradigm to spatial transcriptomics, using graph convolutional net (GCN)-derived spatial spot embeddings that preserve tissue structure, enable robust gene recovery, and capture biological heterogeneity in unsupervised clustering.
6. Limitations and Open Directions
RNA-FM embeddings, while effective, present certain limitations:
- Scalability: Some architectures, e.g., latent transcriptome 2D k-mer embeddings, are not tractable for whole-genome-scale data without further filtering or dimensionality reduction (Trofimov et al., 2018).
- Dimensionality Choices: Fixed low-dimensional projections (e.g., 2D for latent transcriptome) may forfeit nuanced structural discrimination in favor of direct visualization; transformer-based embeddings retain high-dimensionality (e.g., 640D), requiring additional architecture-tuning for some tasks (Chen et al., 2022).
- Modal Coverage: Initial RNA-FM models ignore explicit structure annotations; recent designs (OmniGenome) show state-of-the-art results by integrating structure tokens directly (Yang et al., 2024).
- Generalizability: mRNA2vec evaluations note limitations in species diversity during pre-training; broader taxonomic coverage is likely to further enhance generalization (Zhang et al., 2024).
Potential expansions include integration of nucleotide- and codon-level tokenization, more detailed structural labels for auxiliary tasks, and expanded cross-modal objectives to further align functional and structural signals.
7. Summary Table of Key RNA-FM Embedding Models
| Model | Training Corpus | Embedding Dim | Core Architecture | Downstream Task Classes | Key Benchmarks/Results |
|---|---|---|---|---|---|
| RNA-FM (Chen et al.) | 23M+ ncRNAs, unannotated | 640 | 12L Transformer | Structure, function, evolution, expression | F1(base-pair)>0.94, 3D RMSD~4 Å |
| Latent Transcriptome | Raw RNA-seq k-mers | 2 (per k-mer) | 2L Bi-LSTM | Exon/fusion/mutation recovery | R²(coverage)~0.9, gene clustering, fusion det. |
| mRNA2vec | 5 UTR–CDS mRNAs | 256 | 4L T5-style Transformer | Translation/stability, MFE, SS | Spearman ρ>0.75/0.80 (TE/EL), CDS ρ~0.53 |
| OmniGenome | 1K+ datasets, Seq+Str | 480/720 | 16–32L Transformer, RoPE | Zero-shot structure, design, DNA transfer | 74% EternaV2 (design), macro-F1>0.75 |
| SAGE-FM | 416 Visium spatial slides | 1024 | 5L GCN | Spatial clustering, subtype annotation, L–R | RMSE=0.305, 81% OSCC ann., spatial perturb. |
RNA-FM embeddings have established themselves as the standard feature space for RNA informatics, routinely outperforming or subsuming bespoke, task-limited methods in structure, function, expression, and design-related tasks, and are actively being extended to multimodal and spatial settings (Chen et al., 2022, Yang et al., 2024, Zhang et al., 2024, Zhan et al., 21 Jan 2026).