Papers
Topics
Authors
Recent
Search
2000 character limit reached

RNA Grammar VAE for RNA Design

Updated 19 February 2026
  • RNA Grammar VAE is a generative model that combines a stochastic context-free grammar with a variational autoencoder to ensure valid RNA secondary structures.
  • It employs a stack-based masked decoder and one-hot encoded grammar rules to maintain syntactic conformity while optimizing sequence attributes.
  • The model outperforms traditional methods by using Bayesian optimization in a low-dimensional latent space to meet multi-constraint design objectives.

The RNA Grammar Variational Autoencoder (RGVAE) is a generative model for the design of structurally stable RNA sequences possessing specific target properties. RGVAE integrates a stochastic context-free grammar (SCFG) tailored for RNA secondary structure with a variational autoencoder (VAE) framework. This combination guarantees that generated sequences conform to RNA secondary structure constraints while enabling continuous, low-dimensional optimization of sequence-level and biophysical objectives. The architecture leverages explicit syntactic parsing and latent-variable modeling to significantly improve design efficiency and quality over methods that do not incorporate structured grammar (Zarnaghinaghsh et al., 21 Jul 2025).

1. Stochastic Context-Free Grammar for RNA Structure

RGVAE is grounded in a specialized SCFG, G=(V,Σ,R,S)G=(V,\Sigma,R,S), engineered to produce only those RNA sequences that conform to valid, properly nested secondary structures. The grammar components are:

  • Nonterminals: V={S,L,F}V = \{S, L, F\}
    • SS: Start symbol
    • LL: Loop (unpaired region)
    • FF: Paired region
  • Terminals: Σ={a,a^}\Sigma = \{a, \hat{a}\}
    • aa: Any single nucleotide in {A,C,G,U}\{A, C, G, U\}
    • a^\hat{a}: The base-pairing partner of aa
  • Production rules:
    • SLS    LS \to LS \;|\; L
    • LaFa^    aL \to aF\hat{a} \;|\; a
    • FaFa^    LSF \to aF\hat{a} \;|\; LS

Each rule rRr \in R carries a probability p(r)p(r) such that for any nonterminal α\alpha, r:lhs(r)=αp(r)=1\sum_{r:\text{lhs}(r)=\alpha} p(r) = 1. These parameters are trained on large RNA datasets (e.g., 101,754 tRNA sequences from Rfam) using the Inside–Outside (EM) algorithm. Given a string xΣx \in \Sigma^*, its parse tree TT is a rooted, ordered tree reflecting one valid derivation under GG, with probability p(T)=rTp(r)p(T) = \prod_{r \in T} p(r) and marginal p(x)=Txp(T)p(x) = \sum_{T \Rightarrow x} p(T).

2. Variational Autoencoder Framework

The VAE encodes and decodes entire RNA parse trees within a continuous latent space:

  • Input Representation: Each RNA sequence xx is parsed into its unique GG-parse, linearized into a length-TmaxT_\text{max} sequence X=(x1,,xTmax)X = (x_1, \ldots, x_{T_\text{max}}) of one-hot grammar rule indices.
  • Latent Prior: p(z)=N(0,I)p(z) = \mathcal{N}(0, I).
  • Encoder: A 1D-CNN (as in Kusner et al.) maps XX to (μϕ(X),logσϕ2(X))(\mu_\phi(X), \log \sigma^2_\phi(X)), representing mean and log variance of qϕ(zX)q_\phi(z|X). Sampling uses the standard reparameterization: ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), z=μϕ(X)+σϕ(X)ϵz = \mu_\phi(X) + \sigma_\phi(X)\odot \epsilon.
  • Decoder: A single-layer LSTM emits logit vectors ftRLf_t \in \mathbb{R}^L. To enforce grammaticality, a LIFO stack manages recursive expansions:

    1. Pop top nonterminal α\alpha.
    2. Apply binary mask mαm_\alpha zeroing logits for rules not rooted at α\alpha.
    3. Compute the masked probability:

    p(xt=lα,z)=mα,lexp(ft,l)j=1Lmα,jexp(ft,j).p(x_t = l \mid \alpha, z) = \frac{m_{\alpha,l} \exp(f_{t,l})}{\sum_{j=1}^L m_{\alpha,j} \exp(f_{t,j})}.

  1. Sample and advance the stack according to grammar expansion.
  • Objective: The loss maximizes the usual ELBO:

L(ϕ,θ;X)=Ezqϕ(zX)[logpθ(Xz)]KL[qϕ(zX)p(z)].\mathcal{L}(\phi, \theta; X) = \mathbb{E}_{z \sim q_\phi(z|X)} [\log p_\theta(X|z)] - \mathrm{KL}[q_\phi(z|X)\,||\,p(z)].

3. Latent Embedding and Optimization

Parse trees are mapped via depth-first traversal and one-hot encoding to Tmax×LT_\text{max} \times L tensors, then through the CNN encoder to latent vectors zRdzz\in\mathbb{R}^{d_z}. Empirically, dz=10d_z=10 provided the optimal trade-off between information preservation and regularization. The decoder reconstructs the parse (and thus the sequence) via masked rule sampling; no beam search is performed. This latent space admits efficient optimization for complex RNA design objectives.

To tailor output sequences to multiple, potentially competing, design goals, an aggregate score function F(x)=igi(x)F(x) = \sum_i g_i(x) is constructed from normalized objective-specific metrics gi(x)[0,1]g_i(x) \in [0,1]. These encompass minimum free energy (computed via ViennaRNA), GC-content, sequence length, motif constraints, and secondary-structure similarity metrics. Bayesian optimization (e.g., Gaussian process–based) searches zz-space to maximize F(decodeθ(z))F(\text{decode}_\theta(z)) with limited (order ten to fifty) queries per evaluation. The optimal latent point zz^* yields a decoded sequence most compatible with all specified constraints.

4. Architecture and Training Regimen

  • Encoder: 1D convolutional stack (3 layers, ReLU, pooling) produces (μ,logσ2)(\mu, \log\sigma^2) heads.
  • Decoder: Single-layer LSTM (hidden size 256\approx 256), initial input zz, outputs logits ftf_t to length LL.
  • Maximum steps: TmaxT_\text{max} \approx maximum number of grammar rules in the Rfam training set (200\approx 200).
  • Optimization: Adam, learning rate 10310^{-3}, batch size 128.
  • Data preparation: Each tRNA run through CYK parser to extract unique parse, then encoded for SCFG estimation (inside-outside) and VAE training.
  • Validation: 3-fold cross-validation for both architectural decisions and early stopping.

5. Empirical Evaluation

Multiple experiments establish RGVAE's advantages:

  • Grammar ablation: Comparing a simplified grammar G0G_0 to the designed grammar GG for minimum free energy (MFE), G0G_0 failed to exceed the training set's best (min MFE 91\approx -91 kcal/mol), while GG enabled generation of sequences with MFE as low as 261-261 kcal/mol.
  • Baseline comparison: Randomized design under SCFG yielded no systematic MFE gains; RNAGEN (Ozden et al. 2023) achieved weaker minimization (min 118\approx -118 kcal/mol). RGVAE consistently produced lower MFE sequences across GC-contents ($20$–$70$\%).
  • Multi-constraint targets: For GC =50±2=50\pm2\% and length =100=100–$150$, RGVAE found sequences with MFE 146\approx -146 kcal/mol, outperforming both training and RNAGEN (which produced no feasible examples).
  • Motif constraint enforcement: In scenarios with mandatory (MM) and forbidden (FF) sequence motifs, RGVAE's MFE distributions dominated both training and RNAGEN baselines.
  • Secondary-structure matching: By minimizing base-pair matrix alignment distances to target riboswitch folds, RGVAE achieved scores of 11.5\approx 11.5 vs 50+\approx 50+ for conventional datasets and RNAGEN.

6. Algorithmic Provisions and Key Equations

The RGVAE's theoretical foundation is represented through a library of explicit mathematical forms:

  • Grammar rules:

G:  SLS    L,LaFa^    a,FaFa^    LSG:\; S \rightarrow LS\;|\;L,\quad L\rightarrow aF\hat a\;|\;a,\quad F\rightarrow aF\hat a\;|\;LS

  • Decoder sampling (masked softmax):

p(xt=lα,z)=mα,lexp(ft,l)j=1Lmα,jexp(ft,j)p(x_t=l\mid \alpha,z) = \frac{m_{\alpha,l}\,\exp(f_{t,l})}{\sum_{j=1}^L m_{\alpha,j}\,\exp(f_{t,j})}

  • Likelihood of a rule sequence under grammar constraint:

pθ(Xz)=t=1T(X)p(xtαt,z)p_\theta(X | z) = \prod_{t=1}^{T(X)} p(x_t | \alpha_t, z)

  • Variational objective (ELBO):

L(ϕ,θ;X)=Ezqϕ(zX)[logpθ(Xz)]KL[qϕ(zX)p(z)]\mathcal{L}(\phi, \theta; X) = \mathbb{E}_{z\sim q_\phi(z\mid X)} \bigl[\log p_\theta(X | z)\bigr] - \mathrm{KL}\bigl[q_\phi(z | X)\,\Vert\,p(z)\bigr]

A core element is the stack-based masked decoder, ensuring every emission step adheres strictly to the grammar’s fractioned expansion space, as detailed in the provided pseudocode.

7. Relevance and Distinctions in Sequence Design

RGVAE’s integration of stochastic grammar with variational inference yields an RNA design pipeline that guarantees grammatical validity, supports arbitrary biophysical and sequential constraints, and enables effective optimization via Bayesian search in a low-dimensional space. This design outperforms random sampling, a traditional VAE, and structure-aware baselines in producing sequences that are simultaneously thermodynamically stable, GC-content-compliant, motif-accurate, and optimal for target secondary folding. This approach substantially addresses combinatorial challenges in RNA inverse folding and broadens the toolkit for computational RNA bioengineering (Zarnaghinaghsh et al., 21 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RNA Grammar VAE (RGVAE).