Papers
Topics
Authors
Recent
Search
2000 character limit reached

Demo-ICL-Bench: Procedural Video In-Context Learning

Updated 13 February 2026
  • Demo-ICL-Bench is a benchmark that evaluates the ability of multimodal large language models to rapidly learn procedural video tasks through demonstration-driven in-context learning.
  • It integrates both text-based and video-based demonstrations across diverse domains like cooking, crafting, and home repair to ensure comprehensive procedural understanding.
  • The benchmark uses detailed metrics such as Demo Acc, Δ_ICL, and demonstration selection accuracy to set new empirical baselines and highlight open research challenges.

Demo-ICL-Bench is a benchmark specifically constructed to evaluate the capacity of Multimodal LLMs (MLLMs) to perform “Demo-driven Video In-Context Learning” (ICL) – that is, to acquire novel procedural knowledge from a handful of demonstrations (textual or video) and immediately apply, generalize, and reason about this knowledge in new, unseen video contexts. Introduced by Dong et al., Demo-ICL-Bench addresses a major limitation in existing video-language understanding benchmarks, which predominantly interrogate static pre-trained knowledge or general video question answering, and rarely probe the model’s ability to rapidly adapt by observing, imitating, and learning from concrete demonstrations in context. Demo-ICL-Bench consists of 1,200 high-quality, human-verified questions associated with instructional YouTube videos, with both text-based and video-based demonstration regimes, and includes a demonstration-selection task probing robust demo retrieval and matching. The benchmark, alongside the Demo-ICL model (a variant of Ola-Video with a tailored two-stage training recipe), establishes new empirical baselines and exposes open research challenges for procedural video understanding and interactive learning (Dong et al., 9 Feb 2026).

1. Benchmark Construction and Task Formulation

Demo-ICL-Bench is constructed from the HowTo100M corpus, comprising 1.2 million narrated instructional YouTube clips (~23,000 distinct tasks). Filtering yields candidate videos that (i) span 1–20 minutes, (ii) have English ASR-detected subtitles with word-level timestamps (using WhisperX/HTM-AA), and (iii) contain at least six discernible procedural steps. The benchmark comprises:

  • Text-demo ICL: 500 questions, each providing a textual demonstration (step-by-step instructions).
  • Video-demo ICL: 500 questions, each with K=3K=3 candidate demonstration videos.
  • Demonstration Selection: 200 questions requiring selection of a relevant demonstration from 3–4 candidates.

This dataset is balanced across cooking, crafting, home repair, electronics, and other domains, ensuring diverse procedural content. Each test sample includes a “target” video VtV_t (typically ~5 minutes), a procedural question (such as predicting the next step or reasoning about causality), and a demonstration either as a short text list (TdT_d; 6–12 steps, 5–15 words each) or as one of multiple video clips (VdV_d) semantically filtered and human-validated for procedural consistency.

Text demonstrations are derived by summarizing ASR transcripts into temporally ordered steps using Qwen2.5-72B, filtering for actions, fusing redundant steps, and refining by cross-inspection with 64 sampled video frames via Qwen2.5-VL-72B. Video demonstrations are selected by a ranking pipeline: top-10 videos per task (as per HowTo100M metadata) are re-ranked by Qwen2.5-72B embedding similarity, matched stepwise by auto-instructional extraction, then manually filtered for equivalence.

Questions are generated to align with intermediate (non-terminal) steps and are open-ended multiple-choice in spirit (graded by LLM prompt), falling principally into:

  • Procedural next-step prediction (“After Step ii, what is Step i+1i+1?”),
  • Causal reasoning (“Why perform this action?”),
  • Demonstration selection (selecting the appropriate demonstration among distractors).

No development split is provided; all training and hyperparameter tuning is performed on a large pool outside the test set.

2. Evaluation Metrics and Protocol

Evaluation metrics directly probe in-context learning capability:

  • Demo Acc: Accuracy with the correct demonstration as context.
  • w/o Demo Acc: Accuracy without demonstrations.
  • Δ_ICL: Difference, quantifying true in-context learning benefit (Demo Acc - w/o Demo Acc).
  • S.Acc: For demonstration selection, proportion selecting the correct demonstration.
  • Acc: Accuracy on the subsequent procedural question, conditioned on the selected demonstration.

All metrics are reported as percentages.

The data splits are as follows:

Split Text-demo ICL Video-demo ICL Demo Selection Total
Training 5,000 2,000 1,000 8,000
Benchmark/Test 500 500 200 1,200

Baseline comparisons include human annotators (84.0% text-demo accuracy, 80.4% video-demo, 88.0% demonstration selection S.Acc), proprietary MLLMs (Gemini-2.5-Pro, GPT-4o), and open-source models (Qwen2-VL, Ola, LLaVA-Video, etc.).

3. Model Architecture and Training Paradigm

Demo-ICL is based on the Ola-Video MLLM, comprising:

  • Visual encoder: OryxViT, capable of processing arbitrary-resolution and number of frames (up to 64 at 288×288–480×480 resolution).
  • Language backbone: Qwen2.5-7B, supporting a 16,384 token context window.

Only the training recipe is modified; the architecture is unchanged.

Training proceeds in two stages:

  1. Video-Supervised Fine-Tuning (SFT): The model is trained on millions of multimodal pairs (image–text, video–text) including instructional video datasets (COIN, CrossTask) but excluding Demo-ICL-Bench proper. The loss is cross-entropy over Pθ(yx,D)P_\theta(y \mid x, D), where x=(Vt,q)x=(V_t, q) and DD is the set of demonstrations.
  2. Information-Assisted Direct Preference Optimization (DPO): DPO is used for preference modeling on response alternatives, augmented with “assistive information”:
    • For text-demo ICL, this includes precise step timestamps aligning demonstrations to target video;
    • For video-demo ICL, this involves paired text instructions for each demonstration video.

Pairwise preferences P={(x,Rc,Rr)}\mathcal{P} = \{(x, R_c, R_r)\} are then used to train a reward model rϕr_\phi (using the Bradley–Terry distribution and a cross-entropy objective), and the policy's preferences are iteratively optimized over T rounds, periodically regenerating P\mathcal{P} with the latest model. This methodology reinforces both correct answer selection and sensitivity to provided demonstrations.

4. Empirical Results and Findings

Empirical evaluation demonstrates that Demo-ICL-Bench is substantially more challenging than prior video QA benchmarks. Human annotators’ accuracy remains high; state-of-the-art MLLMs perform at a considerably lower level.

Table summarizing main results:

Model Size Frame Text-demo ICL (Demo / w/o / Δ) Video-demo ICL (Demo / w/o / Δ) Demo Selection (S.Acc / Acc) Avg
Human 84.0 / – / – 80.4 / – / – 88.0 / 76.0 80.1
Gemini-2.5-Pro 32 54.4 / – / – 36.2 / – / – – / 26.0 38.9
GPT-4o 32 48.8 / – / – 31.4 / – / – – / 24.5 34.9
Qwen2-VL 7B 32 29.0 / 21.8 / +7.2 22.4 / 24.0 / –1.6 38.0 / 14.5 22.0
Demo-ICL (SFT) 7B 32 38.4 / 27.8 / +10.6 29.4 / 26.2 / +3.2 54.5 / 21.5 29.8
Demo-ICL 7B 32 43.4 / 29.4 / +14.0 32.0 / 27.6 / +4.4 58.0 / 24.0 33.1

Demo-ICL consistently outperforms all open video MLLMs of equivalent scale; for example, Demo-ICL achieves 43.4% Text-demo ICL (Demo) accuracy, versus 29.0% for Qwen2-VL. The ICL gain (Δ\Delta) confirms that models can and do leverage explicit demonstrations in context.

Ablation studies reveal that:

  • Increasing the number of frames (32→128) delivers +1.0 percentage point.
  • Using “oracle” demonstration (demo=test) boosts accuracy by +9.2 pp.
  • Reliance on ASR captions for textual alignment adds +16.0 pp.
  • Info-Assisted DPO confers +2.4 pp over vanilla DPO.
  • Pretraining exclusively with instructional videos is necessary; omitting such data reduces performance to 26.4% average.

Qualitative error analysis indicates models are especially challenged by fine-grained visual-text temporal mismatch and subtle demonstration–target domain shifts. For selection tasks, distractors with procedural overlap still often mislead base MLLMs.

5. Design Significance and Comparison to Previous Benchmarks

Demo-ICL-Bench departs from most prior video QA datasets (which test static recall or recognition) by compelling dynamic learning in context. The intricate pairing of demonstrations (independently sourced, rigorously validated), carefully-aligned text summaries, and multiple distractors per task amplify both annotation quality and evaluation difficulty. The introduction of both text-based and video-based demonstration modalities, combined with auxiliary demonstration-selection tasks, reflects real-world multimodal ICL demands—where both symbolic and perceptual precedents might be available in a lifelong or session-based learning regime.

The two-stage Demo-ICL training recipe—entirely architectural-invariant—demonstrates that ICL performance is highly dependent on training methodology and modality alignment, rather than brute model scale or vision backbone complexity.

6. Open Challenges and Future Directions

Limitations acknowledged by the authors include the benchmark’s restriction to a training-driven rather than architectural synthesis of ICL capacity, its exclusive use of human-aligned procedural videos, and a lack of exploration into web-scale demonstration mining. A plausible implication is that future MLLMs will require:

  • Purpose-built architectures for temporally compositional video-text representation and alignment in ICL contexts.
  • Multi-modal ICL regimes where text, video, and potentially symbolic state/action traces co-occur.
  • Systems able to retrieve and integrate relevant demonstrations at web scale, perhaps in online or continual learning settings.
  • More sophisticated evaluation settings probing causal reasoning and long-horizon compositionality.

By surfacing these deficiencies and standardizing a difficult ICL regime, Demo-ICL-Bench provides a foundation for systematic improvement of video-grounded in-context learning (Dong et al., 9 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Demo-ICL-Bench Benchmark.