Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation

Published 2 Apr 2026 in cs.CV | (2604.01864v1)

Abstract: Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model's internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the embedding space that is learned more accurately reflects human judgment. We are also introducing a conditional variational module. This approach incorporates an aspect of controlled randomness within the hierarchical token generation process. This capability allows the model to produce a diverse array of coherent images based on ambiguous or open-ended prompts. We conducted extensive experiments using COCO and a newly developed Ambiguous-Prompt Benchmark. The results show that MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility. It exceeds the baseline Hi-MAR model's performance, showing an improvement of +1.6 in CLIPScore and +5.3 in HPSv2. For unclear inputs, it produces a notably wider range of outputs. These findings have been confirmed through both human evaluation and automated metrics.

Authors (2)

Summary

  • The paper introduces MAER, which regularizes the latent space using human-aligned metrics like CLIPScore and HPSv2 to enhance output fidelity.
  • It incorporates an ambiguity-aware stochastic module that models multiple interpretations of prompts to generate diverse, coherent images.
  • Empirical results demonstrate significant gains in automated and human evaluations, validating improved semantic alignment and output diversity.

MAR-MAER: Metric-Aware and Ambiguity-Adaptive Advances in Autoregressive Text-to-Image Generation

Introduction

Autoregressive (AR) image generation models have attained considerable progress in aligning image outputs with text prompts, yet fundamental limitations persist in their fidelity to human quality assessments and their capacity to handle semantic ambiguity. The paper "MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation" (2604.01864) addresses these deficiencies with the introduction of two architectural modifications: (1) Metric-Aware Embedding Regularization (MAER), directly optimizing for human-aligned evaluation metrics, and (2) an Ambiguity-Aware Stochastic Module, which systematically models the inherent uncertainty and multi-interpretability of open-ended prompts. These innovations robustly enhance both metric alignment and output diversity, achieving significant empirical improvements over previous methods.

Technical Contributions

Hierarchical AR Backbone and Modular Enhancements

The base framework adopts a two-stage hierarchical AR model (inspired by Hi-MAR) with independent low- and high-resolution token autoregression. Text prompts are embedded using a frozen CLIP encoder, and the image generation is decomposed into semantic layout (LR tokens) and fine-grained details (HR tokens). MAR-MAER introduces two critical modules into this pipeline:

  • Metric-Aware Embedding Regularization (MAER): Instead of solely maximizing likelihood, MAER regularizes the latent embedding space via an adaptive kernel regression loss. By projecting CLIP image embeddings through an MLP and regressing against external human-aligned metrics (CLIPScore, HPSv2), the model propagates metric-based feedback throughout the generation process. Gradients from the MAER loss shape the generator to produce embeddings and thus images that are locally consistent and better reflect human quality judgments.
  • Ambiguity-Aware Stochastic Module: Recognizing the multi-modal nature of prompt interpretation, this module introduces a Gaussian latent conditioned on the input prompt. By incorporating this distribution at both LR (via prefix tokens) and HR (via Feature-wise Linear Modulation) stages, the model generates diverse, coherent outputs for ambiguous textual queries. The KL-divergence regularization ensures effective utilization and semantic encoding of the latent space.

These modules operate in synergy while keeping the underlying AR architecture unaltered, thereby demonstrating the generality and plug-and-play nature of the augmentations.

Empirical Validation

Evaluation Protocols

The empirical assessment leverages both standard and novel metrics:

  • Automated Metrics:
    • FID (image quality/divergence): MAR-MAER (6.3) is competitive with state-of-the-art AR and diffusion methods.
    • CLIPScore and HPSv2 (semantic alignment, human preference): MAR-MAER advances +1.6 CLIPScore and +5.3 HPSv2 over the strong Hi-MAR baseline.
  • Human Evaluation: On a purpose-built ambiguous prompt benchmark, multiple outputs per prompt are rated for semantic plausibility and diversity. MAR-MAER is superior in presenting both more varied and more plausibly interpretable images compared to previous methods.

Ablation Analysis

The ablation study delineates the isolated and combined effects of MAER and ambiguity modules. Introduction of MAER alone produces a substantial gain in HPSv2 (from 20.3 to 22.6), confirming the efficacy of embedding metric alignment. The ambiguity module contributes to diversity (+0.9), but their joint application realizes the peak improvement (25.6).

Handling Ambiguous Prompts

Empirical examples demonstrate the system's ability to translate high-level, vague concepts ("freedom", "sound of silence") into visually coherent yet semantically diverse outputs. This property is critical for practical human-AI interaction, given the prevalence of context-sensitive or metaphorical prompts in real-world deployments.

Implications and Prospects

The MAR-MAER approach demonstrates that post-architecture regularization and targeted latent variable modeling can successfully address metric misalignment and semantic ambiguity in AR models. These results have crucial implications:

  • Practical: The framework enables finer control over semantic diversity and offers outputs more aligned with direct human evaluation, which is necessary for applications in creative design, prompt-based art generation, and interactive AI systems.
  • Theoretical: The findings challenge the reliance on unconditional maximum likelihood training, instead emphasizing the importance of architectural-agnostic, metric-driven objective design. The successful disentanglement of diversity and alignment suggests avenues for modular generative model engineering.
  • Future Directions: Further sophistication could integrate user preference modeling, dynamic metric fusion, or cross-modal latent conditioning. There is also scope for scaling to larger model sizes and more challenging datasets. Integration of such regularizers with hybrid AR-diffusion architectures is another promising direction.

Conclusion

"MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation" (2604.01864) sets a benchmark for AR image generation by methodically aligning internal representations with human-centric metrics and equipping models to systematically resolve prompt ambiguity. Without altering the standard AR architecture, these interventions yield marked improvements across automated and subjective metrics, supporting a more reliable, adaptable generative framework suitable for both research and practical deployment. Their methodology highlights the value of embedding-level supervision and diversity-centric conditioning as universal tools in advanced generative modeling.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.