MMSynthetic-20K: Synthetic Multimodal Dataset
- MMSynthetic-20K is a synthetic multimodal dataset comprising 20,000 image–question–answer triplets designed to push the boundaries of real-world reasoning in MLLMs.
- The dataset is produced using the CADS framework, which combines adversarial data generation, judgment, and context optimization to yield high-quality, challenging samples.
- Empirical results indicate that training with MMSynthetic-20K notably improves performance on multimodal benchmarks, establishing new state-of-the-art gains in reasoning tasks.
MMSynthetic-20K is a synthetic multimodal dataset comprising 20,000 image–question–answer triplets constructed to facilitate the training and evaluation of Multimodal LLMs (MLLMs) on complex real-world reasoning tasks. Developed via the Collective Adversarial Data Synthesis (CADS) framework, MMSynthetic-20K is designed to provide high-quality, diverse, and challenging multimodal samples that enhance model capability beyond what is achievable with purely real or templated data (Zhang et al., 3 Feb 2026).
1. Collective Adversarial Data Synthesis Framework
The construction of MMSynthetic-20K is orchestrated by the CADS framework, which alternates between two key phases: Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). An additional Adversarial Context Optimization mechanism iteratively steers the generation process toward “boundary” cases that are difficult yet valid, thereby maximizing training signal.
CAD-Generate initiates with a seed pool consisting of real or sketched descriptions. A committee of MLLMs, formalized as , operates as prompt engineers to generate new multimodal instances: Each committee member performs (i) rationale analysis to extract knowledge domains and reasoning chains, (ii) synthesis using one of four meta-strategies (Parameter Variation, Logic Reversion, Auxiliary Extension, Isomorphic Transfer), and (iii) construction of a detailed visual prompt for the image generator, Nano Banana Pro.
CAD-Judge involves the same committee serving as judges. Each attempts to solve the generated question for , yielding predictions . The consensus score,
determines quality: instances with are deemed flawed and are discarded; those with are retained.
Adversarial Context Optimization identifies boundary examples where , indicating disagreements among judges. These challenging instances are used to update the generator prompt/context with the aim of maximizing the production of such adversarial yet correct samples: This mechanism increases the representation of high-value, non-trivial problems.
2. Dataset Composition and Modal Diversity
MMSynthetic-20K consists of 20,000 synthetic samples, each including a generated image , a question or instruction , and a ground-truth answer (textual or numeric). Images are produced by Nano Banana Pro, guided by textual prompts that specify detailed object attributes, spatial relationships, and query intentions.
The dataset covers a broad set of domains:
- Mathematical reasoning (geometry, algebra)
- Physical and biological reasoning (electrical circuits, cell biology)
- Chart understanding (bar, line, pie charts)
- General multimodal QA
Domain proportions are exemplified qualitatively in Figure 1 of the original work, but explicit distributional statistics are not reported. Diversity is intrinsic to the generation process, leveraging multiple seeds and four meta-strategies to ensure variation in both visual and reasoning attributes. No explicit diversity metric is provided.
3. Technical Details of Sample Generation
The CAD-Generate phase employs four cutting-edge LLMs as both prompt engineers and judges: GPT-4o, Gemini-2.5-Flash, DeepSeek-R1, and Claude-4-Sonnet. Prompts generated by the synthesis process feature exhaustive specifications, such as:
Draw a right triangle ABC with , cm, extend altitude from C to meet AB; label lengths ; pose question …
All candidate samples with consensus are excluded (as per Equation (1)). Adversarial examples () are utilized for further context tuning but are not required to be retained in the dataset.
4. Pre-processing and Use in MLLM Training
The pre-processing steps include:
- Removal of unsolvable or invalid instances ()
- Canonicalization of numeric answers to string representations
- Resizing images to match the input resolution requirements of Qwen2.5-VL-7B’s visual encoder
The R1-SyntheticVL model is trained solely on MMSynthetic-20K, using Qwen2.5-VL-7B as the base architecture and Group Relative Policy Optimization (GRPO) within the EasyR1 codebase. Training is conducted with the following hyperparameters: 8 rollout samples per question, batch size of 128, rollout batch size of 256, learning rate , and using 8 NVIDIA H20 GPUs. Mixing MMSynthetic-20K with 2,000 real samples (for a 2K+2K synthetic+real mix) yields superior performance compared to using 4K real samples alone, demonstrating dataset complementarity.
5. Impact on State-of-the-Art Multimodal Reasoning
Training R1-SyntheticVL on MMSynthetic-20K establishes new open-source state-of-the-art results in several multimodal reasoning benchmarks.
| Benchmark/Test | Score (R1-SyntheticVL) | Baseline | Relative Gain |
|---|---|---|---|
| MathVista | 75.6 | ThinkLite-VL-7B: 75.1, Qwen2.5-VL-7B: 68.2 | +7.4 points |
| MMMU-Pro (Reasoning) | 47.8 | Qwen2.5-VL-7B: 42.5 | +5.3 points |
| 8-Benchmark Average | 52.0 | N/A | SOTA among open-source synthetic-only models |
Ablation studies reveal the incremental contribution of each CADS stage:
- Direct image generation (nano) improves MathVista from 68.2 to 70.8.
- Adding CAD-Generate increases performance to 73.0.
- Incorporating CAD-Judge yields 74.6.
- Full CADS, including Adversarial Context Optimization, achieves the best result at 75.6.
The scaling curve (Figure 2) verifies that model performance increases monotonically as synthetic data is scaled from 0.5K to 20K and suggests continued benefit beyond 20K samples.
Mixing real and synthetic data also brings measured benefits: using 2K real plus 2K synthetic outperforms a 4K real-only configuration (74.6 vs. 72.2–73.3 on MathVista).
6. Significance, Limitations, and Future Perspectives
MMSynthetic-20K represents a landmark in large-scale, high-quality multimodal data synthesis for AI research. Its collective and adversarial synthesis pipeline ensures a steady supply of diverse and challenging data points without the scaling limitations of human curation. The dataset’s design enforces validity via multi-model consensus and actively favors high-value boundary cases, which are critical for robust model learning.
Current limitations include the absence of a formal diversity metric and the unknown long-tail domain coverage in the absence of explicit proportions. The reliance on synthetic images created by Nano Banana Pro may impose representational biases tied to that generator’s capabilities.
A plausible implication is that continued refinement of both the synthesis strategies (including more advanced prompt engineering and reasoning meta-strategies) and the mixture of real and synthetic data will drive further advances in MLLM reasoning and generalization across modalities. The scaling trend suggests that larger synthetic datasets of this architecture will retain utility for future models (Zhang et al., 3 Feb 2026).