Papers
Topics
Authors
Recent
Search
2000 character limit reached

Demo-ICL-Bench: Procedural ICL in Videos

Updated 11 February 2026
  • Demo-ICL-Bench is a benchmark that evaluates multimodal language models' ability to learn procedural tasks from videos using in-context demonstrations.
  • It employs a rigorous data curation process with LLM pipelines and human verification to extract and align procedural steps from the HowTo100M corpus.
  • The evaluation protocol uses metrics like Demo Accuracy and ICL gain to quantify improvements when incorporating text-based and video-based demonstrations.

Demo-ICL-Bench is a benchmark specifically constructed to evaluate the capacity of Multimodal LLMs (MLLMs) for in-context learning (ICL) in the domain of procedural knowledge acquisition from videos. The primary goal is to assess not a model’s static knowledge, but rather its ability to dynamically learn and adapt to novel procedural video contexts given few in-context demonstrations, encompassing both text (stepwise instructions) and video exemplars. Demo-ICL-Bench emerges from the broader need to move beyond traditional video understanding tasks, which typically measure retrieval or comprehension from static learned knowledge, toward a paradigm that assesses on-the-fly transfer and adaptation from in-context demonstrations (Dong et al., 9 Feb 2026).

1. Benchmark Construction and Data Curation

Demo-ICL-Bench is derived from the HowTo100M corpus, which comprises over 1.2 million narrated YouTube instructional videos spanning approximately 23,000 activities. For the benchmark, only English-language videos of moderate length (2–10 minutes) and high-quality ASR transcripts (HTM-AA/WhisperX) are retained. The data construction strategy involves three core demonstration modalities:

  • Text Demonstrations: Each selected video’s ASR transcript is summarized into a sequence of numbered procedural steps. This is accomplished using a two-stage LLM pipeline: Qwen2.5-72B first generates coarse outlines, and Qwen2.5-VL-72B jointly uses sampled frames to refine and ground each step, producing concise, human-interpretable instructional sequences.
  • Video Demonstrations: Pairs of videos depicting the same procedural task are constructed. Candidate pairs are first selected from metadata (ranking, title similarity) and further checked by comparing their LLM-generated stepwise instructions to ensure close alignment of demonstrated procedures.
  • Demonstration Selection: For increased task difficulty, the benchmark includes a subset where the model must first select the correct demonstration from a distractor pool, before answering the procedural question.

The final benchmark comprises 500 text-demo questions, 500 video-demo questions, and 200 demonstration-selection items, totaling 1,200 manually curated test samples.

2. Task Definition and Evaluation Protocol

Demo-ICL-Bench formalizes three hierarchical tasks:

  1. Text-Demo ICL: Given a video VV and its text demonstration TT, the model must predict the next procedural step (“What comes next?”) at a chosen intermediate point in the instruction sequence.
  2. Video-Demo ICL: Provided a target video VV and a reference demonstration video VDV_D, the model is required to generate the next step for VV. Alignments between steps are rigorously mapped by LLMs, with human verification of answerability.
  3. Demonstration Selection (DS): Given VV and a set of four candidate demonstrations, the model must select the most suitable demonstration and then answer the next-step question conditioned on its selection.

Evaluation metrics are constructed as follows:

  • Demo. Acc: Accuracy when using demonstrations.
  • w/o Demo: Accuracy without using demonstrations (model relies on internal knowledge only).
  • Δ_ICL: ICL gain, computed as the difference between performance with and without demonstrations.
  • S.Acc (DS): Success rate of demonstration selection.
  • Avg: Arithmetic mean of accuracies across tasks.

3. Demo-ICL Model: Architecture and Training Regime

The Demo-ICL model is implemented atop Ola-Video, deploying the OryxViT vision encoder for dense video frame representation (up to 64 frames at resolutions 288×288–480×480) combined with Qwen2.5 as the LLM backbone, supporting a 16,384-token context. Input demonstrations are encoded as follows:

  • Text demonstrations: Direct concatenation to the question prompt.
  • Video demonstrations: Sampled frames and (optionally) ASR captions from the demo video are passed through the same visual encoder and input into the Transformer alongside the target video.

Demo-ICL training proceeds in two phases:

  1. Video-Supervised Fine-Tuning (VSF): The model undergoes large-scale supervised training on multimodal corpora aggregating millions of image–text and video–text pairs (e.g., from LLaVA-OneVision, VisualWebInstruct, LLaVA-Video, Oryx, COIN, Cross-Task), explicitly injecting demo-driven ICL samples. The cross-entropy loss is defined over the target response tokens, conditioned on both visual and demonstration context.

LVSF=i=1Nj=1y(i)logp(yj(i)y<j(i),V(i),D(i))\mathcal{L}_{\mathrm{VSF}} = -\sum_{i=1}^N \sum_{j=1}^{|y^{(i)}|} \log p(y_j^{(i)} \mid y_{<j}^{(i)}, V^{(i)}, D^{(i)})

  1. Information-Assisted Direct Preference Optimization (IADPO): This fine-tuning step leverages contrastive response pairs, augmented with auxiliary information (e.g., video timestamps or textual summaries) to supply preference signals. The reward model uses a Bradley–Terry preference formulation,

p(RcRrx,I)=σ(r(x,I;Rc)r(x,I;Rr))p^{*}(R_c \succ R_r \mid x, I) = \sigma(r^*(x, I; R_c) - r^*(x, I; R_r))

The policy is updated by maximizing the preference for chosen responses using the learned reward.

4. Experimental Results and Comparative Analysis

Demo-ICL-Bench exposes marked difficulty even for state-of-the-art models. Human upper bounds on the benchmark stand at 84.0% (text-demo), 80.4% (video-demo), and 88% (demonstration selection), with overall average 80.1%. Proprietary models demonstrate moderate performance, e.g., Gemini-2.5-Pro achieves 54.4% (text-demo) and 36.2% (video-demo), and open-source baselines such as Qwen2.5-VL (7B) obtain 32.8% and 28.0% respectively.

The full Demo-ICL (7B) model achieves:

Task Demo-ICL (full, 7B)
Text-Demo Demo.Acc 43.4
Text-Demo w/o Demo 29.4
Δ_ICL +14.0
Video-Demo Demo.Acc 32.0
Video-Demo w/o Demo 27.6
Δ_ICL +4.4
DS S.Acc 58.0
DS Acc 24.0
Avg 33.1

A consistent, substantial gain (ΔICL\Delta_{ICL}) is observed when demonstrations are provided, particularly for text-based demonstrations. Performance generalizes to standard video understanding and temporal tasks (e.g., VideoMMMU, VideoMME, MVBench, LongVideoBench, MLVU), matching or exceeding other 7B models and approaching 72B-scale open-source systems.

Ablation studies on video-demo accuracy confirm that increasing the number of provided frames, supplying ASR captions, and repeating the demo video enhance accuracy, underscoring the criticality of fine-grained temporal and semantic alignment.

5. Design Choices in Data Generation and Demonstration Curation

The construction of Demo-ICL-Bench incorporates rigorous selection and alignment protocols:

  • Automatic Summarization: Multistage LLM-driven summarization of instructional transcripts ensures not only coverage of procedural steps but also semantic grounding in visual evidence via sampled frames.
  • Video Pair Mining and Alignment: Candidate video pairs are mined using metadata and LLM-based semantic similarity, then filtered by alignment of their instructional decompositions. This ensures high-quality exemplars for procedural transfer.
  • Human Verification: Key steps—especially for question answerability—are confirmed by annotators, mitigating annotation noise and enforcing robust testing of procedural reasoning.
  • Distractor Sampling: In the demonstration selection task, close “hard negative” candidate videos are sampled to challenge discriminative procedural transfer.

6. Implications for Procedural In-Context Learning

Demo-ICL-Bench provides a test suite that surfaces several properties of current MLLMs:

  • Procedural Generalization: Most models exhibit only modest benefit from procedural demonstrations, with clear performance gaps between open-source and proprietary systems, and substantial headroom to human-level performance.
  • Demonstration Modality Effects: Text-based demonstrations yield consistently higher ΔICL\Delta_{ICL}, suggesting that alignment and abstraction of procedural steps in text form are more tractable for current models than direct video-to-video transfer.
  • Granularity and Temporal Reasoning: Adding temporally aligned ASR information substantially improves video-demo ICL, revealing that models benefit from explicit semantic–temporal cues.
  • Model and Training Scaling: Larger context lengths, dense frame sampling, and deep video-text pretraining are essential but not sufficient; targeted procedural ICL training and specialized direct preference optimization yield additional but incremental improvements.

A plausible implication is that progress on ICL in procedural video domains will depend upon improved cross-modal grounding, richer alignment of temporally structured demonstration content, and the development of architectures tailored for multi-step, multi-modal sequence induction.

7. Broader Context and Future Directions

Demo-ICL-Bench sets a new standard for evaluating demo-driven, few-shot learning in procedural video understanding. It complements existing benchmarks such as VL-ICL Bench, which focuses on image-text ICL for perception and reasoning (Zong et al., 2024), by directly addressing the challenge of dynamic procedural transfer in temporally extended domains. The benchmark’s methodology—joint curation of textual and video procedural demonstrations, rigorous alignment, and hard distractor-based selection—serves as a template for future multimodal ICL evaluation.

Potential future research directions suggested by this work include the integration of multimodal chain-of-thought, expanding demonstration selection granularity, scaling to longer activity chains, and refining architectures for segment-level temporal compositionality.

Demo-ICL-Bench, together with the corresponding Demo-ICL model, advances the field by enabling systematic, fine-grained measurement of procedural in-context learning abilities for the next generation of video-capable MLLMs (Dong et al., 9 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Demo-ICL-Bench.