Learning to Discover Regulatory Elements for Gene Expression Prediction

Published 19 Feb 2025 in q-bio.GN and cs.AI | (2502.13991v1)

Abstract: We consider the problem of predicting gene expressions from DNA sequences. A key challenge of this task is to find the regulatory elements that control gene expressions. Here, we introduce Seq2Exp, a Sequence to Expression network explicitly designed to discover and extract regulatory elements that drive target gene expression, enhancing the accuracy of the gene expression prediction. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements. Specifically, we propose to decompose the epigenomic signals and the DNA sequence conditioned on the causal active regulatory elements, and apply an information bottleneck with the Beta distribution to combine their effects while filtering out non-causal components. Our experiments demonstrate that Seq2Exp outperforms existing baselines in gene expression prediction tasks and discovers influential regions compared to commonly used statistical methods for peak detection such as MACS3. The source code is released as part of the AIRS library (https://github.com/divelab/AIRS/).

Abstract PDF Upgrade to Chat

Summary

The paper introduces Seq2Exp, a deep learning framework that simultaneously predicts gene expression and learns to discover causal regulatory elements from DNA and epigenomic signals.
Seq2Exp employs separate generator and predictor modules, using an information bottleneck and causal modeling to extract relevant DNA sequences based on combined sequence and epigenomic information.
The Seq2Exp framework achieves state-of-the-art performance in gene expression prediction and extracts regulatory elements that are more predictive than those found by traditional peak-calling methods.

The paper introduces Seq2Exp (Sequence to Expression), a novel deep learning framework for predicting gene expression from DNA sequences, explicitly designed to discover and extract regulatory elements. The framework posits a causal relationship between epigenomic signals, DNA sequences, and regulatory elements, integrating these factors to enhance gene expression prediction accuracy.

Seq2Exp decomposes the learning process into two components. The method employs a generator module to learn a token-level mask based on both DNA sequences and epigenomic signals, extracting relevant DNA sub-sequences. A predictor module then uses these extracted sub-sequences to predict gene expression. By applying an information bottleneck, Seq2Exp filters out non-causal components, ensuring that only the most relevant regions are used for prediction.

Key contributions include:

A framework articulating the causal relationship between epigenomic signals, DNA sequences, target gene expression, and related regulatory elements.
A method to combine the mask probability distribution from DNA sequences and epigenomic signals, filtering out non-causal regions via an information bottleneck.
State-of-the-art (SOTA) performance in gene expression prediction compared to existing baselines, demonstrating that extracted regulatory elements serve as better sub-sequences compared to statistical peak-calling methods such as MACS3.

The method addresses the challenge of predicting gene expression levels using DNA sequences and epigenomic signals. Epigenomic signals are measured using experimental techniques. The regulatory elements that influence target gene expression are often sparse and involve long-range interactions.

The framework uses the structural causal model (SCM) to provide a learnable approach for extracting effective regulatory elements, considering both DNA sequences and epigenomic signals, through an information bottleneck mechanism. The regulatory elements are divided into three categories: $R\textsubscript{g}$ (regulatory elements with the potential to interact with the target gene), $R\textsubscript{m}$ (regulatory elements discovered from measurement), and $R\textsubscript{ag}$ (regulatory elements actively interacting with the target gene).

The causal relationships between these variables are:

$X\textsubscript{seq} \longleftarrow R\textsubscript{g}$: The DNA sequence ($X\textsubscript{seq}$) consists of $R\textsubscript{g}$ and other non-causal parts.
$R\textsubscript{ag} \longrightarrow Y$: The causal part $R\textsubscript{ag}$ directly influences the final gene expression ( $Y$ ).
$R\textsubscript{g} \longleftarrow R\textsubscript{ag} \longrightarrow R\textsubscript{m}$: The key causal component $R\textsubscript{ag}$ is shared by both $R\textsubscript{g}$ and $R\textsubscript{m}$, and can be detected through epigenomic signals and participates in interactions with the target gene.
$R\textsubscript{m} \longrightarrow X\textsubscript{sig}$: $X\textsubscript{sig}$ usually contains strong observable signals, such as peaks in DNase-seq.

The task objective is based on information bottleneck, where the latent representations are defined as $Z = M \odot X$ , with $M$ being a binary variable controlling the selection of each DNA base or a soft mask indicating the importance of each DNA base. The objective becomes:

$L \approx \frac{1}{N} \sum_{i=1}^N \mathbb{E}_{p_{\theta}(m_i|x_i)} [-\log q_{\phi} (y_i|m_i\odot x_i)] + \beta KL[p_{\theta}(m_i|x_i), r(m_i)]$

where the first term is the task-specific loss, such as mean square error in DNA gene expression prediction, and the second term imposes a constraint on the learned mask $m$ , aligning it with the predefined distribution $r(m)$ without conditioning on any specific sequence $x$ .

The paper assumes conditional independence of sequences and signals, where conditioned on the selection of regulatory elements $M$ , the DNA sequences and epigenomic signals are conditionally independent, i.e., $p(X_{sig},X_{seq}|M)=p(X_{sig}|M)p(X_{seq}|M)$ .

Based on this assumption, the estimation of $p_{\theta}(M|X)$ can be decomposed into terms involving $X_{seq}$ and $X_{sig}$ :

$p_{\theta}(M|X) \propto p_{\theta_1}(M|X_{seq}) p_{\theta_2}(M|X_{sig})$

where $p_{\theta_1}(M|X_{seq})$ and $p_{\theta_2}(M|X_{sig})$ represent the contributions from the DNA sequence and the epigenomic signals, respectively.

The mask distribution is assumed to follow the Beta distribution: $m_s \sim \text{Beta}(\alpha, \beta)$ .

Given $p_{\theta_1}(m_s|X_{seq}) \sim \text{Beta}(\alpha_1, \beta_1)$ and $p_{\theta_2}(m_s|X_{sig}) \sim \text{Beta}(\alpha_2, \beta_2)$ , the product of these distributions also follows a Beta distribution, with parameters:

$p_{\theta_1}(m_s|X_{seq}) p_{\theta_2}(m_s|X_{sig}) \sim \text{Beta}(\alpha_1 + \alpha_2 - 1, \beta_1 + \beta_2 - 1)$

To enforce sparsity, the prior distribution of the soft mask $r(m_s)$ also follows the Beta distribution, i.e., $r(m_s) \sim \text{Beta}(\alpha_3, \beta_3)$ , where $\alpha_3$ and $\beta_3$ are related to the sparsity of the mask. The expectation of the Beta distribution is $\mathbb{E}[m_s] = \mu = \frac{\alpha_3}{\alpha_3 + \beta_3}$ .

The model consists of a generator and a predictor. The generator produces the mask distribution $p_\theta(M|X)$ from the DNA sequences and epigenomic signals $X=\{X_{seq},X_{sig}\}$ . The predictor $q_\phi(Y|M\odot X)$ provides gene expression values from the masked sequences $Z=M\odot X$ .

The parameters $\alpha_1$ and $\beta_1$ are derived from the DNA sequences using a neural network $f_\theta$ :

$\alpha_1, \beta_1 = f_\theta(X_{seq})$

For the parameters related to epigenomic signals, the epigenomic signal values are directly used as the parameter $\alpha_2$ , and the parameter $\beta_2$ is set as a fixed constant:

$\alpha_2 = X_{sig}; \beta_2 = C_{\beta}$ .

After estimating the parameters, the soft mask $m_s$ is sampled from the combined Beta distribution, $p_{\theta_1}(m_s|X_{seq}) p(m_s|X_{sig})\sim \text{Beta}(\alpha_1+\alpha_2-1, \beta_1+\beta_2-1)$ . For the hard mask version, a threshold is applied to the soft mask to generate the hard mask, $M = \mathbb{I}(m_s \geq C_m)$ .

The extracted sub-sequences are then fed into a secondary neural network $g_\phi$ to estimate the probability distribution of the target gene expression $Y$ . The conditional distribution is expressed as $q_\phi(Y | M \odot X)$ .

To optimize the loss function, every step must remain differentiable. The Beta distribution is treated as a special case of the Dirichlet distribution, and the reparameterization trick is used to achieve differentiable sampling from the Dirichlet distribution. During inference, the expected value of the Beta distribution is directly used as the soft mask $m_s$ for each base pair, $\mathbb{E}[m_s] = \frac{\alpha}{\alpha + \beta}$ .

The straight-through estimator (STE) is applied to retain differentiability when converting the soft mask $m_s$ into a hard binary mask $M$ .

The model was evaluated by predicting CAGE values for gene expression, focusing on cell types K562 and GM12878. Input data included the HG38 human reference genome, DNase-seq data, H3K27ac ChIP-seq data, and Hi-C data. Evaluation metrics included Mean Squared Error (MSE), Mean Absolute Error (MAE), and Pearson Correlation. The baselines include Enformer, HyenaDNA, Mamba, Caduceus, and EPInformer. The models were trained using a cross-chromosome validation strategy.

The results showed that Seq2Exp outperforms existing methods in gene expression prediction based on CAGE values. Furthermore, Seq2Exp outperformed MACS3 in terms of predictive performance, suggesting the ability of discovering regulatory elements.

Markdown Report Issue