- The paper introduces Seq2Exp, a deep learning framework that simultaneously predicts gene expression and learns to discover causal regulatory elements from DNA and epigenomic signals.
- Seq2Exp employs separate generator and predictor modules, using an information bottleneck and causal modeling to extract relevant DNA sequences based on combined sequence and epigenomic information.
- The Seq2Exp framework achieves state-of-the-art performance in gene expression prediction and extracts regulatory elements that are more predictive than those found by traditional peak-calling methods.
The paper introduces Seq2Exp (Sequence to Expression), a novel deep learning framework for predicting gene expression from DNA sequences, explicitly designed to discover and extract regulatory elements. The framework posits a causal relationship between epigenomic signals, DNA sequences, and regulatory elements, integrating these factors to enhance gene expression prediction accuracy.
Seq2Exp decomposes the learning process into two components. The method employs a generator module to learn a token-level mask based on both DNA sequences and epigenomic signals, extracting relevant DNA sub-sequences. A predictor module then uses these extracted sub-sequences to predict gene expression. By applying an information bottleneck, Seq2Exp filters out non-causal components, ensuring that only the most relevant regions are used for prediction.
Key contributions include:
- A framework articulating the causal relationship between epigenomic signals, DNA sequences, target gene expression, and related regulatory elements.
- A method to combine the mask probability distribution from DNA sequences and epigenomic signals, filtering out non-causal regions via an information bottleneck.
- State-of-the-art (SOTA) performance in gene expression prediction compared to existing baselines, demonstrating that extracted regulatory elements serve as better sub-sequences compared to statistical peak-calling methods such as MACS3.
The method addresses the challenge of predicting gene expression levels using DNA sequences and epigenomic signals. Epigenomic signals are measured using experimental techniques. The regulatory elements that influence target gene expression are often sparse and involve long-range interactions.
The framework uses the structural causal model (SCM) to provide a learnable approach for extracting effective regulatory elements, considering both DNA sequences and epigenomic signals, through an information bottleneck mechanism. The regulatory elements are divided into three categories: $R\textsubscript{g}$ (regulatory elements with the potential to interact with the target gene), $R\textsubscript{m}$ (regulatory elements discovered from measurement), and $R\textsubscript{ag}$ (regulatory elements actively interacting with the target gene).
The causal relationships between these variables are:
- $X\textsubscript{seq} \longleftarrow R\textsubscript{g}$: The DNA sequence ($X\textsubscript{seq}$) consists of $R\textsubscript{g}$ and other non-causal parts.
- $R\textsubscript{ag} \longrightarrow Y$: The causal part $R\textsubscript{ag}$ directly influences the final gene expression (Y).
- $R\textsubscript{g} \longleftarrow R\textsubscript{ag} \longrightarrow R\textsubscript{m}$: The key causal component $R\textsubscript{ag}$ is shared by both $R\textsubscript{g}$ and $R\textsubscript{m}$, and can be detected through epigenomic signals and participates in interactions with the target gene.
- $R\textsubscript{m} \longrightarrow X\textsubscript{sig}$: $X\textsubscript{sig}$ usually contains strong observable signals, such as peaks in DNase-seq.
The task objective is based on information bottleneck, where the latent representations are defined as Z=M⊙X, with M being a binary variable controlling the selection of each DNA base or a soft mask indicating the importance of each DNA base. The objective becomes:
L≈N1i=1∑NEpθ(mi∣xi)[−logqϕ(yi∣mi⊙xi)]+βKL[pθ(mi∣xi),r(mi)]
where the first term is the task-specific loss, such as mean square error in DNA gene expression prediction, and the second term imposes a constraint on the learned mask m, aligning it with the predefined distribution r(m) without conditioning on any specific sequence x.
The paper assumes conditional independence of sequences and signals, where conditioned on the selection of regulatory elements M, the DNA sequences and epigenomic signals are conditionally independent, i.e., p(Xsig,Xseq∣M)=p(Xsig∣M)p(Xseq∣M).
Based on this assumption, the estimation of pθ(M∣X) can be decomposed into terms involving Xseq and Xsig:
pθ(M∣X)∝pθ1(M∣Xseq)pθ2(M∣Xsig)
where pθ1(M∣Xseq) and pθ2(M∣Xsig) represent the contributions from the DNA sequence and the epigenomic signals, respectively.
The mask distribution is assumed to follow the Beta distribution: ms∼Beta(α,β).
Given pθ1(ms∣Xseq)∼Beta(α1,β1) and pθ2(ms∣Xsig)∼Beta(α2,β2), the product of these distributions also follows a Beta distribution, with parameters:
pθ1(ms∣Xseq)pθ2(ms∣Xsig)∼Beta(α1+α2−1,β1+β2−1)
To enforce sparsity, the prior distribution of the soft mask r(ms) also follows the Beta distribution, i.e., r(ms)∼Beta(α3,β3), where α3 and β3 are related to the sparsity of the mask. The expectation of the Beta distribution is E[ms]=μ=α3+β3α3.
The model consists of a generator and a predictor. The generator produces the mask distribution pθ(M∣X) from the DNA sequences and epigenomic signals X={Xseq,Xsig}. The predictor qϕ(Y∣M⊙X) provides gene expression values from the masked sequences Z=M⊙X.
The parameters α1 and β1 are derived from the DNA sequences using a neural network fθ:
α1,β1=fθ(Xseq)
For the parameters related to epigenomic signals, the epigenomic signal values are directly used as the parameter α2, and the parameter β2 is set as a fixed constant:
α2=Xsig;β2=Cβ.
After estimating the parameters, the soft mask ms is sampled from the combined Beta distribution, pθ1(ms∣Xseq)p(ms∣Xsig)∼Beta(α1+α2−1,β1+β2−1). For the hard mask version, a threshold is applied to the soft mask to generate the hard mask, M=I(ms≥Cm).
The extracted sub-sequences are then fed into a secondary neural network gϕ to estimate the probability distribution of the target gene expression Y. The conditional distribution is expressed as qϕ(Y∣M⊙X).
To optimize the loss function, every step must remain differentiable. The Beta distribution is treated as a special case of the Dirichlet distribution, and the reparameterization trick is used to achieve differentiable sampling from the Dirichlet distribution. During inference, the expected value of the Beta distribution is directly used as the soft mask ms for each base pair, E[ms]=α+βα.
The straight-through estimator (STE) is applied to retain differentiability when converting the soft mask ms into a hard binary mask M.
The model was evaluated by predicting CAGE values for gene expression, focusing on cell types K562 and GM12878. Input data included the HG38 human reference genome, DNase-seq data, H3K27ac ChIP-seq data, and Hi-C data. Evaluation metrics included Mean Squared Error (MSE), Mean Absolute Error (MAE), and Pearson Correlation. The baselines include Enformer, HyenaDNA, Mamba, Caduceus, and EPInformer. The models were trained using a cross-chromosome validation strategy.
The results showed that Seq2Exp outperforms existing methods in gene expression prediction based on CAGE values. Furthermore, Seq2Exp outperformed MACS3 in terms of predictive performance, suggesting the ability of discovering regulatory elements.