InstructTime: Time Series with Instructions

Updated 28 January 2026

InstructTime is a framework that reconceptualizes time series tasks as instruction-conditioned, multimodal sequence generation or embedding problems.
It leverages vector-quantized tokenization, cross-modal alignment, and reward-driven reinforcement learning to enhance classification, editing, and reasoning.
The approach achieves state-of-the-art performance on benchmarks, enabling applications in clinical reasoning, timeline assembly, and temporal QA.

InstructTime refers to a family of methods, model architectures, and benchmarks uniting instruction-driven approaches for time series learning, generation, reasoning, and editing. The InstructTime paradigm repositions traditional time series tasks—classification, editing, assembly, question answering, and temporal reasoning—as instruction-conditioned, multimodal sequence generation or embedding problems. This methodology enables models to jointly process raw time-series data, contextual meta-data, and explicit textual instructions, leveraging the semantics and generalization ability of LLMs and multimodal LLMs. Core innovations include vector-quantized tokenization, cross-modal alignment, reward-driven RL, structured outputs, and the use of natural-language instructions for framing both analytic and generative time series tasks. The InstructTime framework supports applications ranging from interpretable clinical reasoning and visual timeline assembly to controlled time series editing and open-ended temporal QA, achieving state-of-the-art performance on diverse temporal benchmarks.

1. Instruction-Driven Reformulation of Time Series Tasks

Traditional time series classification (TSC) employs a discriminative mapping from $\mathbb{R}^{L \times H} \rightarrow \{c_1, \dots, c_K\}$ using cross-entropy over one-hot labels, a formulation that obscures class similarity and limits cross-domain transfer. InstructTime reformulates TSC as a conditional language generation problem: the model receives a continuous sequence $X \in \mathbb{R}^{L \times H}$ , contextual features $z$ , and a textual prompt $I$ , and produces the class label as a string $y$ via an LLM-based generator. This approach provides several advantages:

Semantic alignment between classes: class relationships can be learned via textual descriptions rather than orthogonal vectors.
Instruction tuning: task- or domain-specific guidance is provided as natural language prompts, enabling rapid adaptation and injection of domain knowledge.
Multimodality: the time series (after discretization) and text are processed together, allowing for flexible joint modeling (Cheng et al., 2024, Cheng et al., 21 Jan 2026).

The paradigm generalizes beyond classification to editing (instruction-based transformation), timeline assembly (instructional sequence manipulation over visual/temporal segments), and temporal question answering, always structuring the problem as instruction $\rightarrow$ (multi-modal input) $\rightarrow$ free-form, structured, or edited output (Wang et al., 13 Mar 2025, Pardo et al., 2024, Qiu et al., 2 Aug 2025).

2. Multimodal Representation and Alignment

A fundamental challenge for instruction-driven modeling is the modality gap between continuous time series and discrete lexical tokens. The InstructTime methodology employs vector-quantized (VQ) autoencoders to convert $X$ into discrete tokens $t = (t_1, ..., t_P) \in \{1, ..., K\}^P$ , where $K$ is the codebook size and $P$ is the number of temporal patches. Each $t_p$ is embedded and (optionally) position-encoded, then projected via an alignment MLP to match the LLM's embedding space (Cheng et al., 2024, Cheng et al., 21 Jan 2026):

$h_p = W_e e_p + b_e,\quad e_p \in \mathbb{R}^d,~h_p \in \mathbb{R}^{d_\text{model}}$

This fine-tuned projection ensures discrete time-series tokens and text tokens are compatible input streams for a shared Transformer backbone. Inputs are concatenated (instruction, context, time series tokens) and fed to the LM, supporting both cross-domain generalization and task adaptation.

For editing, the InstructTime model learns a shared normalized embedding space for both textual instructions and encoded time series via a contrastive (InfoNCE) loss, and then decodes edited series from interpolated embeddings, enabling fine-grained, instruction-conditioned edits and continuous control over edit strength (Qiu et al., 2 Aug 2025).

3. Supervised and Reinforcement Training Pipelines

InstructTime frameworks rely on either fully supervised (maximum likelihood, cross-entropy) or hybrid supervised plus reinforcement learning (RL) for optimizing instruction-conditioned sequence models:

Supervised Pre-training: Cross-domain auto-regressive generative pre-training is used to align representations, build transferability, and encode label semantics into the model (Cheng et al., 2024, Cheng et al., 21 Jan 2026).

Structured Output Templates: For reasoning and classification, outputs are forced into a strict three-block XML format: > (free-form reasoning), <class> (single label), <extension> (optional task-specific extension), improving interpretability and downstream parsing (Zhang et al., 16 Jun 2025).

Token-Level RL (GRPO): A composite reward aligns format adherence, hard supervision (exact classification), and soft (LLM-judged) open-ended quality, using group sampling and normalization for stability. The Group Relative Policy Optimization (GRPO) procedure supports targeted improvements in format and extension without policy collapse (Zhang et al., 16 Jun 2025).

In assembly and editing, loss functions combine autoregressive log-likelihood for output token sequences (timeline edits) or a joint contrastive and reconstruction loss for cross-modal editing (Pardo et al., 2024, Qiu et al., 2 Aug 2025).

4. Implicit Feature Integration and Enrichment

While discretized representations and alignment projections enable handling arbitrary time series, LLMs lack an inherent inductive bias for explicit temporal patterns such as trend, periodicity, or event structure. InstructTime++ augments raw and context input with implicit features:

Statistical Feature Extraction: Summary statistics (mean, variance, skewness, entropy, trend coefficients) are extracted from $X$ and converted into natural language sentences, appended to the prompt (Cheng et al., 21 Jan 2026).

Vision-Language Image Captioning: Rendered signal plots (2D line graphs) are captioned via pretrained image-LLMs (CLIP, VinVL), providing an additional textual summary of shape or pattern.

This enriched prompt enables the LLM to leverage both automatically computed temporal features and high-level visual context, improving generalization and performance (e.g., +5–10 points F1 over non-enriched baselines) (Cheng et al., 21 Jan 2026).

5. Evaluation Benchmarks and Empirical Performance

The InstructTime paradigm has set new baselines on several structured benchmarks:

TimerBed: Aggregate of six real-world tasks (RCW, TEE, ECG, EMG, HAR, CTU) with complex and probabilistic temporal structure (Zhang et al., 16 Jun 2025).

VIST-A and VID-A: Visual timeline assembly; models must edit image/video sequences as instructed in natural language (Pardo et al., 2024).

TIMEBench: Video LLMs evaluated on five distinct facets of temporal understanding (dynamic, reasoning, duration, location, order), with new debiasing and shortcut-filtering protocols (Wang et al., 13 Mar 2025).

EngineMT-QA: Large-scale, multi-task time series QA; performance measured on open-ended and classification QA (Wang et al., 25 Jun 2025).

Key quantitative advances:

Model Main Task Best F1/Accuracy Performance vs. Prior/SOTA Reference

InstructTime-Adapt EEG, ECG, HAR 0.624/0.554/0.931 +7–9% F1 over best non-instructive (Cheng et al., 21 Jan 2026, Cheng et al., 2024)

InstructTimeRL TimerBed 75.3% (avg. acc.) +14.6 pp over classical, +7.3 pp over GPT-4o (Zhang et al., 16 Jun 2025)

Timeline Assembler VIST-A, VID-A 74–82% EM (13B) +25–33 pp over GPT-4o, LLaVA (Pardo et al., 2024)

ITFormer-7B EngineMT-QA 88.7% F1 (reasoning) +30–40 pp over Time-LLM & GPT-4o (Wang et al., 25 Jun 2025)

InstructTime++ EEG, ECG 0.674/0.649 (F1) +5–10 pp over strong baselines (Cheng et al., 21 Jan 2026)

InstructTime Edit Synthetic Editing ΔDTW, RaTS, MSE Best editability/preservability (Qiu et al., 2 Aug 2025)

Ablations reveal the necessity of instruction tuning, discretization, alignment layers, enrichment with implicit features, and structured outputs for robust cross-domain and compositional generalization (Cheng et al., 21 Jan 2026, Zhang et al., 16 Jun 2025, Pardo et al., 2024).

6. Applications and Generalizability

The InstructTime framework encompasses a broad application spectrum:

Clinical Temporal Reasoning: In TIMER, instruction tuning with precise time anchoring enables LLMs to outperform base models by 7–9% on longitudinal EHR benchmarks. Distributional control over temporal sampling addresses recency and mid-sequence info loss (Cui et al., 6 Mar 2025).

Visual and Temporal Assembly: Instructed sequence generation matches or surpasses human and LLM "oracle" edit strategies without specialized instruction dispatchers or extensive hand labeling (Pardo et al., 2024).

Time Series Editing: The editing variant allows for generalizing to unseen instructions or attribute levels (few-shot adaptation), and supports smooth interpolation between original and edited states, outperforming diffusion-based editors in both edit fidelity and attribute preservation (Qiu et al., 2 Aug 2025).

Temporal QA: Adapter-based QA frameworks (e.g., ITFormer) outperform full fine-tuned and LLM-based baselines on both open-ended and structured temporal question answering, with <1% additional learned parameters (Wang et al., 25 Jun 2025).

Temporal-Reasoning Video LLMs: Multi-task, instruction-driven fine-tuning on unsupervised auxiliary temporal tasks improves dynamic, reasoning, and order understanding, without sacrificing general video QA performance (Wang et al., 13 Mar 2025).

Generalizable lessons include the centrality of structured outputs for parseability, multi-component reward optimization for interpretability and accuracy, groupwise advantage normalization for RL stability, and the efficacy of small supervised "warm-up" datasets for accelerating downstream task acquisition (Zhang et al., 16 Jun 2025, Cheng et al., 21 Jan 2026).

7. Limitations and Future Directions

Despite consistent empirical gains, current implementations face limitations:

Temporal Arithmetic and Symbolic Reasoning: Existing architectures lack explicit operators for interval or date calculations, which hampers multi-hop reasoning and duration estimation (Cui et al., 6 Mar 2025).

Contextual Side-Information: Most current systems use static prompt templates for context; dynamic prompts or context encoders for heterogeneous side data remain underexplored (Cheng et al., 2024).

Model Scaling and Multimodality: Initial InstructTime work relied on relatively small backbone LMs (e.g., GPT-2); further gains may accrue from scaling to LLaMA or GPT-3 class models (Cheng et al., 2024).

Hybridization: Combining tokenized time series with classical deep models (transformers, CNNs) may yield further improvements through hybrid representations (Cheng et al., 2024, Cheng et al., 21 Jan 2026).

Expansion to Other Temporal Domains: The instruction-driven paradigm has potential for legal, incident, and financial logs, and presents an open direction for regulated temporal-consistency losses or symbolic arithmetic integration (Cui et al., 6 Mar 2025).

These advances collectively establish InstructTime as a foundation for robust, interpretable, and cross-domain instruction-tuned time series modeling and reasoning in modern multimodal AI systems.

Model	Main Task	Best F1/Accuracy	Performance vs. Prior/SOTA	Reference
InstructTime-Adapt	EEG, ECG, HAR	0.624/0.554/0.931	+7–9% F1 over best non-instructive	(Cheng et al., 21 Jan 2026, Cheng et al., 2024)
InstructTimeRL	TimerBed	75.3% (avg. acc.)	+14.6 pp over classical, +7.3 pp over GPT-4o	(Zhang et al., 16 Jun 2025)
Timeline Assembler	VIST-A, VID-A	74–82% EM (13B)	+25–33 pp over GPT-4o, LLaVA	(Pardo et al., 2024)
ITFormer-7B	EngineMT-QA	88.7% F1 (reasoning)	+30–40 pp over Time-LLM & GPT-4o	(Wang et al., 25 Jun 2025)
InstructTime++	EEG, ECG	0.674/0.649 (F1)	+5–10 pp over strong baselines	(Cheng et al., 21 Jan 2026)
InstructTime Edit	Synthetic Editing	ΔDTW, RaTS, MSE	Best editability/preservability	(Qiu et al., 2 Aug 2025)