- The paper introduces MAER, which regularizes the latent space using human-aligned metrics like CLIPScore and HPSv2 to enhance output fidelity.
- It incorporates an ambiguity-aware stochastic module that models multiple interpretations of prompts to generate diverse, coherent images.
- Empirical results demonstrate significant gains in automated and human evaluations, validating improved semantic alignment and output diversity.
MAR-MAER: Metric-Aware and Ambiguity-Adaptive Advances in Autoregressive Text-to-Image Generation
Introduction
Autoregressive (AR) image generation models have attained considerable progress in aligning image outputs with text prompts, yet fundamental limitations persist in their fidelity to human quality assessments and their capacity to handle semantic ambiguity. The paper "MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation" (2604.01864) addresses these deficiencies with the introduction of two architectural modifications: (1) Metric-Aware Embedding Regularization (MAER), directly optimizing for human-aligned evaluation metrics, and (2) an Ambiguity-Aware Stochastic Module, which systematically models the inherent uncertainty and multi-interpretability of open-ended prompts. These innovations robustly enhance both metric alignment and output diversity, achieving significant empirical improvements over previous methods.
Technical Contributions
Hierarchical AR Backbone and Modular Enhancements
The base framework adopts a two-stage hierarchical AR model (inspired by Hi-MAR) with independent low- and high-resolution token autoregression. Text prompts are embedded using a frozen CLIP encoder, and the image generation is decomposed into semantic layout (LR tokens) and fine-grained details (HR tokens). MAR-MAER introduces two critical modules into this pipeline:
- Metric-Aware Embedding Regularization (MAER): Instead of solely maximizing likelihood, MAER regularizes the latent embedding space via an adaptive kernel regression loss. By projecting CLIP image embeddings through an MLP and regressing against external human-aligned metrics (CLIPScore, HPSv2), the model propagates metric-based feedback throughout the generation process. Gradients from the MAER loss shape the generator to produce embeddings and thus images that are locally consistent and better reflect human quality judgments.
- Ambiguity-Aware Stochastic Module: Recognizing the multi-modal nature of prompt interpretation, this module introduces a Gaussian latent conditioned on the input prompt. By incorporating this distribution at both LR (via prefix tokens) and HR (via Feature-wise Linear Modulation) stages, the model generates diverse, coherent outputs for ambiguous textual queries. The KL-divergence regularization ensures effective utilization and semantic encoding of the latent space.
These modules operate in synergy while keeping the underlying AR architecture unaltered, thereby demonstrating the generality and plug-and-play nature of the augmentations.
Empirical Validation
Evaluation Protocols
The empirical assessment leverages both standard and novel metrics:
- Automated Metrics:
- FID (image quality/divergence): MAR-MAER (6.3) is competitive with state-of-the-art AR and diffusion methods.
- CLIPScore and HPSv2 (semantic alignment, human preference): MAR-MAER advances +1.6 CLIPScore and +5.3 HPSv2 over the strong Hi-MAR baseline.
- Human Evaluation: On a purpose-built ambiguous prompt benchmark, multiple outputs per prompt are rated for semantic plausibility and diversity. MAR-MAER is superior in presenting both more varied and more plausibly interpretable images compared to previous methods.
Ablation Analysis
The ablation study delineates the isolated and combined effects of MAER and ambiguity modules. Introduction of MAER alone produces a substantial gain in HPSv2 (from 20.3 to 22.6), confirming the efficacy of embedding metric alignment. The ambiguity module contributes to diversity (+0.9), but their joint application realizes the peak improvement (25.6).
Handling Ambiguous Prompts
Empirical examples demonstrate the system's ability to translate high-level, vague concepts ("freedom", "sound of silence") into visually coherent yet semantically diverse outputs. This property is critical for practical human-AI interaction, given the prevalence of context-sensitive or metaphorical prompts in real-world deployments.
Implications and Prospects
The MAR-MAER approach demonstrates that post-architecture regularization and targeted latent variable modeling can successfully address metric misalignment and semantic ambiguity in AR models. These results have crucial implications:
- Practical: The framework enables finer control over semantic diversity and offers outputs more aligned with direct human evaluation, which is necessary for applications in creative design, prompt-based art generation, and interactive AI systems.
- Theoretical: The findings challenge the reliance on unconditional maximum likelihood training, instead emphasizing the importance of architectural-agnostic, metric-driven objective design. The successful disentanglement of diversity and alignment suggests avenues for modular generative model engineering.
- Future Directions: Further sophistication could integrate user preference modeling, dynamic metric fusion, or cross-modal latent conditioning. There is also scope for scaling to larger model sizes and more challenging datasets. Integration of such regularizers with hybrid AR-diffusion architectures is another promising direction.
Conclusion
"MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation" (2604.01864) sets a benchmark for AR image generation by methodically aligning internal representations with human-centric metrics and equipping models to systematically resolve prompt ambiguity. Without altering the standard AR architecture, these interventions yield marked improvements across automated and subjective metrics, supporting a more reliable, adaptable generative framework suitable for both research and practical deployment. Their methodology highlights the value of embedding-level supervision and diversity-centric conditioning as universal tools in advanced generative modeling.