Demo-Driven Video In-Context Learning

Updated 11 February 2026

Demo-driven video in-context learning is a paradigm that uses explicit demonstration clips to enable models to learn and generalize on new video tasks without parameter updates.
The approach leverages transformer-based, diffusion-based, and multimodal language models to capture complex spatiotemporal dynamics and procedural understanding.
Key challenges include demonstration selection, context window limitations, and robust multimodal integration to ensure reliable adaptation and fine-grained procedural transfer.

Demo-driven video in-context learning is a paradigm in which models learn and generalize to new video tasks by conditioning on explicit demonstration clips or procedural examples at inference time. Inspired by early breakthroughs in language in-context learning, this approach extends the concept of learning “from context” to the complex spatiotemporal and multimodal dynamics of video. The emergence of zero- and few-shot competence in contemporary transformer-based and diffusion-based architectures, as they scale in size or adopt carefully designed training and prompting protocols, has established demo-driven video in-context learning as a central direction in modern video understanding, generation, and embodied AI.

1. Formal Definition and Core Objectives

In demo-driven video in-context learning (video ICL), the model receives $k$ demonstration examples—videos, or videos paired with annotations (text, action labels, etc.)—and a target query video. The objective is to generate a continuation, answer, or output conditioned on the demonstration set $D^v = \{ (s^1_i, s^2_i, ..., s^{n_i}_i) \}_{i=1}^k$ and query $(x^v)$ , such that the model predicts $P(y^v | x^v, D^v)$ , where $y^v$ is the desired output (continuation, caption, action label, etc.) (Zhang et al., 2024). No explicit parameter updates are performed; all learning occurs “in-context” via the demonstration information.

Key sub-problems are:

Procedural understanding: extracting and applying new stepwise knowledge from demonstrations (e.g., generating a standard operating procedure from workflow videos (Xu et al., 2024)).
Action or event imitation: generating video continuations that align with provided visual semantics (e.g., seen in video diffusion models (Liu et al., 2024, Sun et al., 2024)).
Multimodal reasoning: leveraging text, audio, or state trajectory information alongside or instead of pure video (e.g., instructional text+video (Dong et al., 9 Feb 2026)).
Retrieval-augmented and confidence-aware example selection: automatically identifying and prioritizing relevant or reliable demonstrations to maximize adaptation under context and fidelity constraints (Kim et al., 2024, Fujii et al., 22 Jan 2026).

2. Model Architectures and Training Protocols

Transformer-based Autoregressive Models

Decoder-only transformers trained on discrete video tokens (e.g., VQ-GAN codes) have demonstrated zero-shot video imitation. Here, video frames are compressed spatially, tokenized, and presented sequentially with demonstrations and queries concatenated; the transformer autoregressively predicts the continuation (Zhang et al., 2024). Training is self-supervised, using next-token prediction over video streams with no explicit segmentation into “demos” and “queries,” allowing contextual semantics to emerge naturally.

Diffusion-based Generative Models

For video generation, diffusion models are conditioned on learned representations of demonstration clips (“demo latents”), often via spatio-temporal cross-attention mechanisms or design bottlenecks that factor out appearance and distill action dynamics (Liu et al., 2024, Sun et al., 2024). Architectures such as WALT-U-Net, VideoPrism, and latent VAE backbones are used. In “Action Prism” designs, reference dynamics from demos are extracted, aggregated, and injected into the diffusion U-Net. Training employs reconstruction losses over noisy latents to drive in-context transfer of action and style (Liu et al., 2024). Other methods concatenate scenes temporally/spatially and rely on self-attention layers and LoRA adaptation for demo-to-target transfer (Fei et al., 2024).

Multimodal LLMs (MLLMs)

MM-LMs such as Qwen2.5-VL and OryxViT integrate visual token streams and text via cross-attention, enabling procedural video reasoning with either text or video demonstration (Dong et al., 9 Feb 2026, Xu et al., 2024). In some cases, visual backbones (e.g., CLIP-ViT) preprocess video while large-scale LLMs (Gemini, GPT-4o-mini, Qwen2.5-VL) model interleaved multimodal contexts and output procedural knowledge (Xu et al., 2024, Song et al., 6 Oct 2025).

3. In-Context Demonstration Integration, Selection, and Prompting

Example Selection Protocols

Given that video LMMs and diffusion transformers are often context-limited (due to token or memory constraints), intelligent selection and ordering of demonstrations is crucial. Several approaches are employed:

Similarity-based retrieval: Compute a composite similarity metric between the query and candidate demos (cosine similarity over learned text/video embeddings), balancing text and visual relevance via a parameter $\alpha$ (Kim et al., 2024).
Density-uncertainty-weighted sampling: In label-limited settings, select demos that are simultaneously dense (representative) and informative (uncertain) in the embedding space using GMM posteriors and model-estimated zero-shot confidence (Fujii et al., 22 Jan 2026).
Confidence-augmented selection: Iteratively select and prompt with batches of top- $k$ demos, re-running inference with new examples if confidence (defined as minimum output token probability) is low, until a reliability threshold is met (Kim et al., 2024).
Pseudo-labeling and consensus aggregation: Divide training data into batches, generate pseudo-labels (“pseudo-SOPs”) for each, then aggregate via majority vote or in-context ensemble regularization (Xu et al., 2024).

Prompt Construction and Formatting

Multimodal prompts: Structure prompts to interleave video tokens, text instructions, and pseudo-labels, marking ground-truth and noisy labels, and highlighting confidence levels as in “confidence-aware prompting” (Fujii et al., 22 Jan 2026).
Joint captioning for multi-scene generation: Form compound textual instructions specifying both demo and target semantics to guide transformer attention (Fei et al., 2024).
Procedural instruction conversion: ASR-generated transcripts and visual features are summarized, segmented, and aligned to produce in-context demonstrations, with LLMs refining procedural or action steps (Dong et al., 9 Feb 2026, Song et al., 6 Oct 2025).
Segmented visual trajectories: For agentic decision-making in GUI environments, short visually grounded subsequences are recovered from demonstration videos, filtered, labeled with local objectives, and dynamically selected as in-context micro-demos (Liu et al., 6 Nov 2025).

4. Evaluation Methodologies and Empirical Results

Metrics

Video generation: PSNR (↑), LPIPS (↓), FID (↓), Fréchet Video Distance (FVD), CLIP consistency metrics for fidelity and alignment; human preference studies on visual quality, action transfer, and context consistency (Liu et al., 2024, Sun et al., 2024, Zhang et al., 2024).
Video understanding: Classification accuracy (Video-Acc, Probing-Acc), recall/precision for procedural step extraction, temporal order preservation (Xu et al., 2024, Dong et al., 9 Feb 2026).
Procedural agents: Task success rate on OSWorld/WebArena or domain benchmarks, ablation by demonstration type and content (Song et al., 6 Oct 2025, Liu et al., 6 Nov 2025).

Empirical Findings

Transformer video models display emergent zero-shot in-context imitation: e.g., a 1.1B parameter Vid-ICL achieves +6.8% Probing-Acc and +1.8% Video-Acc over no-demo baselines. Demos must be “in-class”; random demos degrade performance (Zhang et al., 2024).
Video diffusion models with Action Prism in-context modules reduce FVD by up to 36% on open-domain benchmarks and achieve user preference scores above 80% in text-video alignment (Liu et al., 2024).
Ensemble and confidence-filtered prompting strategies in workflow SOP settings boost recall by over +6.7% and temporal ordering by +4.2% versus baseline ICL (Xu et al., 2024).
Out-of-distribution generalization is substantially improved by demo-driven ICL: VideoICL yields gains of +14–54.6% on OOD benchmarks with iterative, confidence-based prompting (Kim et al., 2024).
Visual demonstration trajectories mined and selected from online videos at inference time raise CUA task success by 2–4% over strong text-only methods, with further ablations confirming the necessity of visual-action pairing and local in-context selection (Song et al., 6 Oct 2025, Liu et al., 6 Nov 2025).
In hybrid low-resource settings (e.g., industrial video domains), leveraging density-uncertainty sampling, pseudo-labeling, and confidence-aware ICL enables few-shot adaptation at matching or surpassing baseline accuracy with 90% annotation cost reduction (Fujii et al., 22 Jan 2026).

5. Representative Approaches

Framework	Target Task	Demo Integration	Selection/Prompting
Vid-ICL (Zhang et al., 2024)	Video imitation/allocation	AR Transformer (VQ-VAE)	Concatenated token stream
VideoICL (Kim et al., 2024)	OOD video QA/captioning	MLLM	Similarity+confidence-iterative
Δ-Diffusion (Sun et al., 2024)	Video generation	Latent diffusion/VidPrism	Action bottleneck subtraction
ICE (Xu et al., 2024)	Workflow SOP extraction	LLM (ICL ensemble)	Batch-majority voting
VIOLA (Fujii et al., 22 Jan 2026)	Low-label video ICL	Any MLLM	Density-uncertainty/confidence
Demo-ICL (Dong et al., 9 Feb 2026)	Procedural QA/reasoning	OryxViT, Qwen2.5-VL	Two-stage: Video SFT + DPO
Watch & Learn (Song et al., 6 Oct 2025)	CUA workflow planning	UI state-action pairs	Web-scale trajectory mining
Video Demos (Liu et al., 6 Nov 2025)	GUI agent decision	Visual micro-trajectories	Real-time two-stage selection

Each column denotes the principal application, approach to demonstration integration, and selection/prompting mechanism, as documented in the respective sources.

6. Limitations and Future Directions

Current demo-driven video ICL methods are primarily constrained by:

Context window size: Both transformers and video LMMs have fixed context length (e.g., 4K tokens or 32 frames). This limits demonstration diversity and sequence horizon (Zhang et al., 2024).
Tokenization granularity: Most pipelines use spatial-only VQ-GANs; designing spatio-temporal tokenizers or latent spaces is an active area for improvement (Zhang et al., 2024).
Demonstration reliability: Noisy or random demonstrations can degrade performance; robust selection or adversarial demo filtering is essential (Kim et al., 2024, Zhang et al., 2024, Fujii et al., 22 Jan 2026).
Limited modal and procedural scope: Techniques are predominantly single-mode (video or text); joint audio-video-action demonstration, long-horizon/untrimmed sequences, and 3D/physics-based scenarios remain relatively unexplored (Sun et al., 2024, Fei et al., 2024, Dong et al., 9 Feb 2026).
Data and compute requirements: Many frameworks require hundreds of millions of video tokens or web-scale trajectory mining; making these approaches annotation- and compute-efficient is a focus (Fujii et al., 22 Jan 2026, Song et al., 6 Oct 2025).
Benchmarking and evaluation: New tasks (e.g., Demo-ICL-Bench) reveal the pitfall that open-source and proprietary models still struggle in fine-grained procedural video ICL (Dong et al., 9 Feb 2026).

Active directions include:

Scaling context via spatiotemporal compressors and retrieval-augmented memory (Zhang et al., 2024, Fujii et al., 22 Jan 2026).
Multimodal and compositional demonstration fusion (text+video+audio) (Dong et al., 9 Feb 2026).
Automated explanation/rationale generation for demonstration introspection (Song et al., 6 Oct 2025).
Hierarchical and auto-regressive chaining of demonstrations for longer-horizon planning (Sun et al., 2024).
Task-driven and confidence-aware demo mining for real-world open-world adaptation (Liu et al., 6 Nov 2025, Song et al., 6 Oct 2025).

7. Significance and Outlook

Demo-driven video in-context learning consolidates advances from generative modeling, multimodal transformer architectures, and agentic workflow learning around the goal of enabling “learning to learn” directly from video demonstrations. This bridges static video understanding and open-world deployment where models must adapt to novel tasks and environments based on context alone. Ongoing work in benchmark construction, data efficiency, robust procedural transfer, and compositional prompt engineering is critical for closing the gap between human and artificial generalization from demonstration. The field is rapidly evolving, with open-sourced models and benchmarks accelerating evaluation and innovation (Zhang et al., 2024, Kim et al., 2024, Dong et al., 9 Feb 2026).