Delay Embedding Theory in Neural Sequence Models

Updated 20 January 2026

Delay Embedding Theory is a framework for reconstructing a system’s latent state from time-delayed observations based on Takens’ theorem.
Neural sequence models, including transformers and LRUs, utilize delay embeddings to efficiently capture hidden dynamics for forecasting and reconstruction.
Empirical findings show that structured models like LRUs achieve superior embedding quality and parameter efficiency, highlighting trade-offs in noise robustness and training time.

The delay embedding theory of neural sequence models posits that the capacity of modern sequence models—such as transformers and state-space architectures—to infer unobserved or latent state from observed time series can be rigorously analyzed in terms of classical results from dynamical systems. Central to this perspective is Takens’ embedding theorem, which guarantees that, under suitable conditions, the full latent state of a dynamical system can be reconstructed from a sufficiently long history of a single observed measurement. This framework provides a quantitative, geometry-informed account for why specific neural architectures succeed or fail in reconstructing and predicting hidden variables from partial, noisy observations (Ostrow et al., 2024).

1. Theoretical Basis: Delay Embedding and Takens’ Theorem

Takens’ delay-embedding theorem states that for a compact $d$ -dimensional smooth manifold $M$ and smooth diffeomorphism $\varphi: M \to M$ , the history of scalar measurements $h: M \to \mathbb{R}$ forms the $m$ -delay vector $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ . When $m \geq 2d + 1$ , $F_h^m$ is generically an embedding: its image $F_h^m(M)$ is diffeomorphic to $M$ . There exists an inverse (reconstruction map) $M$ 0 such that $M$ 1. Thus, stacking $M$ 2 consecutive time-lagged samples of an observed variable suffices to reconstruct the original system’s dynamical state.

The embedding dimension bound $M$ 3 guarantees non-overlapping (injective) reconstruction, but in practice $M$ 4 may be increased for robustness to noise, and the quality of the reconstruction degrades for $M$ 5 below the bound, especially under finite-sample or noisy conditions.

2. Connection to Neural Sequence Models

Modern sequence models process time series data as a sequence of input tokens $M$ 6 and compress them into a hidden state $M$ 7. If interpreted via delay embedding, the hidden state $M$ 8 serves as a learned delay coordinate embedding that unfolds the latent dynamical structure of the input. Specifically, the hidden state is trained to encode all information from a recent window $M$ 9 needed both to predict future observations and reconstruct latent variables.

Empirical studies on the 3D Lorenz attractor demonstrate that even when only a noisy scalar $\varphi: M \to M$ 0 is available, constructing $\varphi: M \to M$ 1-length delay vectors with $\varphi: M \to M$ 2 or $\varphi: M \to M$ 3 recovers the attractor’s geometry. Neural sequence models, provided with this partial observation, can thus be tested for their ability to learn such embeddings from data (Ostrow et al., 2024).

3. Architectural Realizations and Inductive Biases

3.1 One-Layer Transformer Decoder

A one-layer transformer decoder processes input tokens $\varphi: M \to M$ 4, applies positional encodings, and uses self-attention to compute output activations. The attention mechanism performs a learned, context-dependent selection over prior positions, implicitly choosing a sparse set of delays. The transformer’s parameter count scales as $\varphi: M \to M$ 5, as the weight matrices are full-rank.

3.2 Linear Recurrent Unit (LRU) State-Space Model

The LRU is a discrete-time linear state-space model governed by

$\varphi: M \to M$ 6

where $\varphi: M \to M$ 7 is diagonal with unit-disk eigenvalues, and $\varphi: M \to M$ 8 are appropriately sized real matrices. The LRU’s hidden state $\varphi: M \to M$ 9 encodes past inputs through a superposition of complex rotations, preserving a nearly uniform and redundant delay embedding of the input history. The model’s parameter count grows as $h: M \to \mathbb{R}$ 0.

3.3 Inductive Bias Comparison

At initialization, LRU’s structure inherently captures the entire past via its stable, diagonalizable $h: M \to \mathbb{R}$ 1-matrix. This aligns with the requirement $h: M \to \mathbb{R}$ 2, as increasing $h: M \to \mathbb{R}$ 3 directly increases the effective embedding window. Transformers, by contrast, start with randomly initialized attention weights and must learn to select the relevant delays. Empirically, the LRU’s hidden states exhibit coarse delay-embedding structure even before training, whereas transformer embeddings achieve similar quality only after substantial training.

4. Experimental Methodology

4.1 Synthetic Time-Series Benchmark

Experiments utilize the Lorenz attractor with parameters $h: M \to \mathbb{R}$ 4, $h: M \to \mathbb{R}$ 5, $h: M \to \mathbb{R}$ 6. Trajectories are generated with added Gaussian noise of variance $h: M \to \mathbb{R}$ 7 on the observed $h: M \to \mathbb{R}$ 8. A dataset of 2,000 trajectories, each of length 600 (with the initial 100 steps discarded), forms the basis for next-step prediction tasks (Ostrow et al., 2024).

4.2 Training Objective

Models are trained with Adam for 1,000 epochs on the Mean Absolute Standardized Error (MASE): $h: M \to \mathbb{R}$ 9 averaged over $m$ 0. Only model runs with final MASE $m$ 1 are analyzed.

4.3 Embedding-Quality Metrics

Evaluation includes:

Nonlinear and linear decoding $m$ 2: Decoders trained to map hidden state $m$ 3 to unobserved variables $m$ 4.
Neighbors Overlap: Compares overlap of nearest neighbors in the true latent state and the learned hidden-state space.
Unfolding (Conditional Variance): Assesses how well $m$ 5 predicts conditional variance of future observations.
Participation Ratio (PR): Quantifies effective embedding dimension in the hidden-state covariance.

5. Quantitative Performance and Embedding Analysis

Parameter efficiency: For a fixed parameter count, the LRU architecture consistently achieves lower MASE than the transformer (e.g., at $m$ 6, LRU attains MASE $m$ 7, transformer $m$ 8).
Embedding development: LRU decoders reach high decoding $m$ 9 ( $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 0) almost immediately, with Neighbors Overlap progressing from $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 1 to $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 2. Transformers require $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 3 epochs to begin matching this embedding quality.
Correlation between embedding and prediction: Strong negative correlations are observed: nonlinear decoding $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 4 vs. MASE ( $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 5), neighbors overlap vs. MASE ( $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 6), with higher embedding quality tightly linked to better prediction accuracy.
Noise sensitivity: Increasing observation noise ( $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 7) increases LRU’s MASE by $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 8, transformer by $F_h^m(x) = (h(x), h(\varphi(x)), \ldots, h(\varphi^{m-1}(x))) \in \mathbb{R}^m$ 9. LRU’s lower effective embedding dimension (PR $m \geq 2d + 1$ 0) makes it more sensitive to noise than the transformer (PR $m \geq 2d + 1$ 1).
Scaling laws: No clear power-law between MASE and $m \geq 2d + 1$ 2 is observed; performance gains diminish for $m \geq 2d + 1$ 3 (i.e., values far in excess of $m \geq 2d + 1$ 4).

6. Conclusions and Model Design Implications

Inductive bias of models: State-space models such as LRUs function as near-uniform delay embedders of input history, achieving high-quality geometric embedding and parameter efficiency from initialization. Transformers, while capable of learning viable embeddings, require more parameters and training time due to the need to select informative delays from scratch.
Embedding strength and prediction: The tight correlation between embedding quality metrics and next-step prediction accuracy confirms the utility of delay-embedding theory as an analytical framework.
Noise and data regime considerations: In low-data or compute-constrained scenarios, structured state-space models with $m \geq 2d + 1$ 5 offer advantages. In highly noisy or nonstationary settings, transformers' selective embedding via attention may guard against overfitting measurement noise.
Hybrid architectures: A plausible implication is the potential benefit of hybrid designs combining a state-space backbone for robust large- $m \geq 2d + 1$ 6 embeddings with attention-based mechanisms to refine context relevance.

By synthesizing classical delay-embedding theory with analysis of neural sequence models, this framework enables precise, geometry-aware criteria for evaluating and developing sequence architectures in partially observed dynamical systems tasks (Ostrow et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Delay Embedding Theory of Neural Sequence Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Delay Embedding Theory of Neural Sequence Models.