Effective evaluation of unified multimodal generation

Develop effective methodologies to evaluate unified multimodal generation systems that jointly produce images and text in response to a single prompt, ensuring that the evaluation captures both modalities and their interaction.

Background

Unified multimodal generation requires models to reason across modalities and produce both images and text for a single query. Existing evaluations are often limited to visual question answering or text-to-image generation and therefore do not capture this joint generation setting.

The paper notes that simple LLM-as-a-judge approaches and data-independent grading can miss sample-specific subtleties or require extensive pairwise comparisons. UEval proposes rubric-based, data-dependent evaluation as a step toward addressing this gap, but explicitly states that determining how to evaluate unified multimodal generation effectively remains open.

References

How to effectively evaluate unified multimodal generation remains an open problem.

UEval: A Benchmark for Unified Multimodal Generation  (2601.22155 - Li et al., 29 Jan 2026) in Section: Rubric Generation and Evaluation