JavisInst-Omni: Multimodal Instruction Dataset

Updated 9 February 2026

JavisInst-Omni is a comprehensive dataset family enabling instruction-tuning for multimodal LLMs across vision, language, action, and audio-video streams.
It integrates unified tokenization and diverse sampling protocols from gameplay and audiovisual sources to support complex, real-world interaction tasks.
Methodologies include LLM-driven instruction synthesis, fine-grained segmentation, and rigorous human-in-the-loop verification to meet benchmark quality standards.

JavisInst-Omni is a family of instruction-tuning datasets designed to enable highly capable, multimodal LLMs (MLLMs) for instruction-following tasks involving complex interactions among vision, language, action, and, in certain variants, synchronized audio-video streams. Datasets under the JavisInst-Omni name span grounded embodied learning scenarios (e.g., open-world Minecraft agents) as well as synchronous sounding-video comprehension and generation, supporting unified benchmarking and instruction tuning for next-generation multimodal AI systems (Wang et al., 2024, Liu et al., 28 Dec 2025).

JavisInst-Omni refers to two core datasets introduced independently to support large-scale vision-language-action (VLA) models and multimodal LLMs for sounding-video (JAV) comprehension and generation, respectively.

OmniJARVIS JavisInst-Omni (Wang et al., 2024): A ∼1T-token-scale, fully annotated, multimodal corpus for training open-world instruction-following agents in Minecraft. It consists of 600,000 trunked Minecraft interactions with language, vision, and action segments, designed for unified tokenization and end-to-end training of VLA transformers.
JavisGPT JavisInst-Omni (Liu et al., 28 Dec 2025): A ≈200,000-example dataset curated for instruction tuning unified audio-video-text LLMs. It supports both comprehension (QA, captioning) and generation (text→AV, AV-conditioned) in temporally aligned, sounding-video settings, with careful dialogue curation and role labeling.

This dual-use of the "JavisInst-Omni" name reflects converging efforts to construct large, high-quality, instruction-centric datasets for autonomous agents in both grounded (embodied) and audio-visual media environments.

2. Data Collection Methodologies and Construction Pipeline

The construction processes are tailored to their respective domains:

For OmniJARVIS JavisInst-Omni (Wang et al., 2024):

Source Trajectories: Draws primarily from OpenAI’s VPT Minecraft gameplay dataset and JARVIS-1 agent interactions, extracting 600,000 trunks of 128 frames each.
Sampling Protocol: Segments are sampled every 32 frames, enforcing regular re-planning intervals and granular behavior segmentation.
Task Suite:
- Atomic tasks: Four low-level skill types (e.g., chop trees, dig dirt).
- Programmatic tasks: Thirty medium-horizon crafting tasks, grouped by difficulty (wooden, food, stone, iron, diamond).
- Open-ended creative: Freeform, unconstrained goals (e.g., “build an oak boat and sail”).
- Embodied QA (EQA): Agent-centric question answering, requiring memory, reasoning, and planning.
Instruction and Reasoning Generation: Instructions are synthesized by prompting GPT-3.5 based on gameplay meta-events; agent reasoning (chain-of-thought) and memory segments are generated with LLM summarization.

For JavisGPT JavisInst-Omni (Liu et al., 28 Dec 2025):

Modalities and Aggregate Sources:
- Audio-only QA (55K) from AudioSet, AudioCaps, VGGSound, and others.
- Video-only QA (60K) from LLaVA-Video-178K.
- Image-only QA (20K) from LLaVA-OneVision.
- AV Comprehension and Generation: 95K AV-QA from VideoLLaMA2, 20K AV captions from TAVGBench, 150K text→AV from TAVGBench, with 200K total dialogues.
Synthetic Generation and Curation:
- JavisInst-Und (110K): Comprehension dialogues, created via GPT-4o using prompt templates specific to subcategory and difficulty.
- JavisInst-Gen (90K): Generation-oriented dialogues, created with 3,000 prompt templates and further paraphrased for diversity using GPT-4o-mini.
- Human-in-the-loop: Manual spot-checking (≈10%) for compliance and correctness, with re-prompting as needed.

3. Data Representation, Tokenization, and Dialogue Structuring

Both datasets emphasize unified tokenization and rich multimodal context encoding.

OmniJARVIS JavisInst-Omni (Wang et al., 2024):

Segment Structure and Format:
- $D^{(\text{inst})} \rightarrow D^{(\text{mem})} \rightarrow D^{(\text{obs})} \rightarrow D^{(\text{tht})} \rightarrow D^{(\text{bhv})} \rightarrow D^{(\text{obs})} \ldots$
- Instruction ( $D^{(\text{inst})}$ ): Natural language summary of the goal.
- Memory ( $D^{(\text{mem})}$ ): LLM-summarized subtask history.
- Observation ( $D^{(\text{obs})}$ ): 128×128 RGB frame, tokenized by LLaVA-7B’s vision encoder.
- Chain-of-thought ( $D^{(\text{tht})}$ ): LLM-produced rationale, conditioned on the current context.
- Behavior ( $D^{(\text{bhv})}$ ): 128-frame trajectory, compacted into 5 discrete behavior tokens via Finite Scalar Quantization (FSQ).
Unified Tokenizer: A combined vocabulary merging BPE language tokens (~50K), special visual prefix tokens, and 35 behavior tokens enables a single transformer to model text, vision, and action in an autoregressive sequence.
Packing Example:

1	[INST] "..."; [MEM] "..."; <Vis-tokens>; [THOUGHT] "..."; <BHV-12>, <BHV-07>, <BHV-03>, <BHV-29>, <BHV-15>

At inference, the agent emits chains-of-thought and behavior tokens in alternation every N=32 frames.

JavisGPT JavisInst-Omni (Liu et al., 28 Dec 2025):

Dialogue Roles and Structure:
- System preamble (optional context)
- User ("mquered:") queries or issues instructions.
- Assistant ("mansgreen:") responds with answers and, in generation, AV placeholders ("<|av_start|>...[video+audio]...<|av_end|>").
Instruction Taxonomy:
- Comprehension: Entity, Relation, and Global-level queries (existence, alignment, grounding, counting; spatial, temporal, causal; emotion, atmosphere, theme).
- Generation: Formal/colloquial instruction→AV, conditioning by modality (V2A, A2V, I2AV, AV-Extend, AV-Edit), and multi-turn scenarios (Proactive, Rethink, Und2Gen).
Annotation Schema: Each dialogue is enriched with modality context, instruction type, difficulty, ground-truth answers, and explanation fields. Provenance and human-verification flags are included.

4. Tokenization Algorithms and Multimodal Integration

Both datasets advance unified tokenization and effective merging of multimodal information streams.

OmniJARVIS JavisInst-Omni (Wang et al., 2024):

Behavior Tokenization via FSQ:
- Trajectory latent $z_e = \text{Encoder}(o_{1:128}) \in \mathbb{R}^d$ is quantized by nearest neighbor to 5 FSQ codebooks; index sets have sizes [8, 8, 8, 6, 5] yielding 35 unique behavior tokens.
- Objective:
$\underset{\phi, \theta}{\arg\min}\; \mathbb{E}_{\tau \sim D} \left[-\sum_{t=1}^{128} \log \pi_\theta(a_t \mid o_{1:t}, \{s_1...s_5\}) \right]$ - Enables compact, semantically meaningful trajectories that facilitate long-range planning with efficient autoregressive modeling.
Vocabulary Fusion: Language, vision, and behavior tokens are merged in a single autoregressive stream, allowing the transformer to interleave text instructions, visual context, reasoning, and discrete actions for end-to-end instruction-following.

JavisGPT JavisInst-Omni (Liu et al., 28 Dec 2025):

Tokenization and Training Objective:
- All instruction dialogues (text, AV, multi-turn) are tokenized for transformer-based modeling.
- The loss incorporates next-token prediction, alignment between LLM queries and DiT condition embeddings, and diffusion denoising for AV generation:
$\mathcal{L} = \mathcal{L}_{\text{ntp}} + \mathcal{L}_{\text{align}} + \mathcal{L}_{\text{diff}}$
Synchrony-Aware Data: Emphasizes explicit temporal alignment of video and audio in both comprehension and generation, enforced at the data and model architecture level.

5. Dataset Statistics and Quality Assurance

Rigorous annotation and curation underpin both datasets.

Dataset	Scale	Modalities	Dialogue Types	Key Quality Metrics
OmniJARVIS JavisInst-Omni	600K interactions	Text, vision, action	Instruction-following, EQA	~1T tokens, ≥0.9 atomic reward, 59% programmatic success
JavisGPT JavisInst-Omni	≃200K dialogues	Audio, video, text	QA (JavisInst-Und), Gen (JavisInst-Gen)	98% format compliance, ≥95% QA accuracy, 0.55 comprehension, 0.45 generation

OmniJARVIS JavisInst-Omni:
- ~600,000 interaction segments, ~900M tokens for instruction-following, 300,000 QA dialogues, for a total of ~1T tokens.
- Benchmarks: Atomic tasks (reward >0.9), programmatic (59% agent success vs. baselines), creative (FSD 886.25 vs. 929–975), EQA (GPT-4 Elo 8.40).
JavisGPT JavisInst-Omni:
- N_total ≃ 200,000; p_und = 0.55, p_gen = 0.45.
- Spot-check accuracy: ≥98% format compliance, ≥95% QA answer correctness. Inconsistent or incorrect dialogues (<2%) are dropped or regenerated.

6. Example Instances and Dialogue Illustrations

Representative samples demonstrate the multimodal, instruction-centric character:

OmniJARVIS JavisInst-Omni Example (Wang et al., 2024):

Packed token sequence for a crafting task: $D^{(\text{inst})}$ 0 The agent alternately emits chain-of-thought and behavior tokens every 32 frames using the full context history.

JavisGPT JavisInst-Omni Example (Liu et al., 28 Dec 2025):

Comprehension (entity-alignment):
- Context: Video shows a lion; audio is a roaring sound.
- User: “Does the lion in the video make the roaring sound in the audio?”
- Assistant: “Yes. The audio roar aligns with the visual of the lion roaring.”
Generation (multi-turn, proactive):
- User: “Create a video where a car drives by with an orange balanced on its roof.”
- Assistant: “Sure—here’s a clip of a red sedan passing by with a bright orange perched on top, you can hear the soft hum of the engine and the gentle thump as the fruit wobbles.”
- <|av_start|>[8s 240×240 video + 16 kHz stereo audio]<|av_end|>

7. Usage Scenarios, Benchmarks, and Adoption

JavisInst-Omni datasets are designed as comprehensive resources for benchmarking and developing open-ended, instruction-following multimodal agents.

OmniJARVIS JavisInst-Omni:
- Trains and evaluates VLA transformers in open-world settings; supports atomic, programmatic, and creative task spectra.
- Evaluation: Task-specific metrics (reward, success rate, FSD for creativity, EQA Elo).
JavisGPT JavisInst-Omni:
- Enables MLLMs for audio/video QA, captioning, AV generation, and multi-turn instruction-following.
- Benchmarks supported: AVQA, MU-AVQA, AVSD, JavisBench-mini (generation), and chatbot scenarios.
- Reported gains: AVQA accuracy gains (91.5%→93.8%), FVD improvement (327.8→317.5), and enhanced AV synchrony (0.153→0.157).

This suggests that JavisInst-Omni datasets play a key role in the scaling and unification of multimodal LLMs for diverse, instruction-centric robotic and media domains, offering systematic evaluation and robust compositional supervision (Wang et al., 2024, Liu et al., 28 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents (2024)

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to JavisInst-Omni Instruction Dataset.