Papers
Topics
Authors
Recent
Search
2000 character limit reached

SDForger Synthetic Time Series Framework

Updated 26 December 2025
  • SDForger is a framework that generates synthetic multivariate time series by transforming data into textualized embeddings for LLM-based synthesis.
  • It employs fill-in-the-middle prompts and LoRA fine-tuning to efficiently learn from limited samples while maintaining the original data's statistical properties.
  • The framework achieves high statistical fidelity using metrics like MDD, ACD, and DTW, enabling enhanced downstream forecasting performance.

SDForger is a framework for generating synthetic multivariate time series utilizing LLMs with a compact tabular embedding that enables efficient, high-fidelity data synthesis. The central innovation is transforming both univariate and multivariate time series into textualized embeddings, making it possible to fine-tune autoregressive LLMs—even with limited computational resources and few real instances—and conditionally generate new samples that preserve the target data’s statistical and temporal structure (Rousseau et al., 21 May 2025).

1. Problem Formulation and Evaluation Metrics

The objective scenario assumes observed multivariate time series

X={Xi}i=1I,XiRC×LX = \{ X_i \}_{i=1}^I, \quad X_i \in \mathbb{R}^{C \times L}

with CC channels and sequences of length LL. The generative task is to create I~\tilde{I} new instances

X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}

matching the joint distribution of XX.

Distributional similarity is quantified primarily via:

  • Marginal Distribution Difference (MDD):

MDD=1CLc=1Ct=1LFXc(xt)FX~c(x~t)\text{MDD} = \frac{1}{CL} \sum_{c=1}^C \sum_{t=1}^L \left| F_{X^c}(x_t) - F_{\tilde X^c}(\tilde x_t) \right|

where FF denotes empirical CDFs.

  • Autocorrelation Difference (ACD):

ACD=1Cc=1Cτ=1τmaxρXc(τ)ρX~c(τ)\text{ACD} = \frac{1}{C} \sum_{c=1}^C \sum_{\tau=1}^{\tau_{\max}} \left| \rho_{X^c}(\tau) - \rho_{\tilde X^c}(\tau) \right|

with ρ(τ)\rho(\tau) the autocorrelation.

Additional distance-based metrics employed include Euclidean Distance (ED), Dynamic Time Warping (DTW), and shapelet-based reconstruction error (SHR).

Forecasting accuracy is evaluated by training predictive models—such as Tiny Time Mixer (TTM)—on generated data, real data, or combinations. Performance is measured with standard error metrics: CC0 This multi-faceted evaluation benchmarks both statistical similarity and utility for downstream forecasting (Rousseau et al., 21 May 2025).

2. Embedding Construction and Data Representation

Each channel CC1 is modeled as a real-valued function CC2 over CC3. Key steps in constructing embeddings are:

  • Functional Basis Projection: Basis functions CC4 (usually via Functional PCA or FastICA) are selected per channel. Embedding coefficients

CC5

are concatenated over channels to form an embedding vector

CC6

with CC7.

  • Normalization: Each sequence is mean-standardized at each timestamp (CC8), where CC9 are computed across all instances.
  • Tokenization: Each real-valued coefficient is mapped to a decimal string in LL0 (e.g., “0.1234”) and tokenized using the LLM’s vocabulary, relying solely on the model's native BPE or byte-level encoding.

No further quantization is introduced beyond the parameterization defined by the basis decomposition and LLM tokenizer.

3. LLM Fine-Tuning and Prompt Engineering

  • Prompt Structure: Each embedding vector LL1 is permuted via a random permutation LL2 and converted into a fill-in-the-middle text prompt:

LL3

where LL4 denotes string concatenation. Randomization over LL5 mitigates position bias.

  • Optimization Objective: LLM parameters LL6 are optimized to minimize cross-entropy over the combined prompt+target sequence LL7:

LL8

LL9

Only I~\tilde{I}0 and I~\tilde{I}1 are tuned, with I~\tilde{I}2 frozen.

This workflow enables SDForger to adapt general-purpose LLMs with a minimal number of embedding instances, often as low as I~\tilde{I}3–I~\tilde{I}4 rows, for high-quality time series synthesis (Rousseau et al., 21 May 2025).

4. Synthetic Sequence Sampling and Inverse Mapping

  • Autoregressive Sampling: At inference, SDForger populates a template with I~\tilde{I}5 blanks and prompts the LLM to sample, stepwise, in token space by top-I~\tilde{I}6 (fixed I~\tilde{I}7 highest probability tokens) or nucleus (top-I~\tilde{I}8 cumulative probability) selection.
  • Inverse Decoding: Generated text I~\tilde{I}9 is parsed to extract feature–value pairs X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}0, which are mapped back to the real domain by reconstructing the original channels using the corresponding basis expansion:

X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}1

Concatenation across all channels yields synthetic X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}2.

  • Post-processing Filters: Outputs are filtered to remove NaNs, duplicates, or X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}3-norm outliers before acceptance.

This process yields synthetic sequences with statistics and dynamics aligned to the original distribution.

5. Statistical Fidelity and Downstream Use

SDForger’s fidelity is established by:

  • Feature-based metrics: MDD (marginal), skewness, kurtosis, and ACD (autocorrelation differences).
  • Distance-based metrics: Euclidean distance, DTW, and SHR via shift-invariant dictionary learning.
  • Multivariate Structure: Cross-covariances between channels and time-lagged dependencies are computed:

X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}4

For downstream forecasting, TTM is trained on (1) zero-shot, (2) real only, (3) synthetic only, or (4) real+synthetic regimes, with accuracy quantified on held-out test sets (RMSE, MAPE, MASE). Results show synthetic data from SDForger alone often matches utility of real data, with mixed training yielding further gains.

6. Comparative Perspective and Algorithmic Features

Compared to GANs and VAEs, SDForger:

  • Leverages LLM adaptation via textual prompts, not from-scratch training.
  • Handles long sequences in X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}5 time, agnostic to sequence length X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}6.
  • Supports few-shot adaptation with minimal data.
  • Enables textual conditioning for semantic or channel-specific control (e.g., “Condition: data is temperature”).

Algorithmic summaries:

  • Fine-tuning:

XX1

  • Generation:

XX2

7. Constraints, Extensions, and Outlook

Limitations include the need for moderate embedding dimension X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}7 (e.g., X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}8) relative to sample size to prevent overfitting and unstable synthesis. Excessive X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}9 slows LLM convergence and degrades sample quality.

Potential extensions:

  • Encoder-only LLMs (e.g., BERT) with masking for imputation.
  • Adaptive data-driven basis selection (XX0 per channel).
  • Joint time series–text pretraining for richer multimodal generation (e.g., incorporating event annotations).
  • Feeding synthetic data back into LLM pretraining to enhance in-context forecasting ability (Rousseau et al., 21 May 2025).

SDForger establishes a workflow for high-fidelity, few-shot, multimodally conditioned synthetic time series generation, leveraging the infrastructure and text reasoning of modern LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SDForger.