Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Published 28 Dec 2023 in cs.CV, cs.CL, and cs.LG | (2312.17090v1)

Abstract: The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.

Abstract PDF HTML Upgrade to Chat

References (51)

Citations (73)

View on Semantic Scholar

Summary

The paper presents a method that replaces numerical scores with text-defined rating levels to better align LMM outputs with human judgment.
It maps mean opinion scores to discrete levels using equidistant partitioning and leverages softmax pooling for deriving predictions.
The framework unifies IQA, IAA, and VQA tasks, demonstrating enhanced data efficiency and improved generalization across various benchmarks.

Evaluating Visual Content: Text-Defined Levels in Multi-Modality Models

In the field of visual content evaluation, the application of large multi-modality models (LMMs) has gained substantial attention due to their extensive potential in bridging visual and natural language understanding. The paper, "Q-A LIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels," introduces an innovative approach focusing on enhancing the interpretability and alignment of machine-generated scores with human preferences by using discrete text-defined rating levels.

Methodology and Implementation

The central innovation of this study lies in shifting from numerical scores to text-defined rating levels, such as excellent, good, fair, poor, and bad, as the basis for imparting judgment to LMMs, particularly for image and video quality assessment tasks. This methodology emulates human rating practices in subjective studies wherein raters do not conventionally provide precise scores but categorize items into qualitative levels. Leveraging this insight, Q-A LIGN employs text-defined level ratings as a training target, thereby simplifying the cognitive load on LMMs, which naturally aligns with their designed purpose of human-like textual comprehension and generation.

During training, existing datasets' mean opinion scores (MOS) are mapped to text levels using equidistant interval partitioning. At inference, a softmax pooling approach extracts log probabilities over these discrete levels to derive the LMM-predicted score, achieving an overview similar to human rating processes.

Experimental Evaluation

Using datasets across various domains—image quality assessment (IQA), image aesthetic assessment (IAA), and video quality assessment (VQA)—the implementation of Q-A LIGN demonstrates significant advancements over state-of-the-art methodologies. The proposed methodology achieves not only competitive performance with a fraction of the annotated data but also markedly improves generalization to unseen data (out-of-distribution datasets). The paper discusses results on prominent benchmarks such as KonIQ, SPAQ, and LSVQ, showcasing superior performance across both typical and cross-dataset settings.

Implications and Future Directions

The implications of adopting discrete-level-based syllabuses extend beyond mere scoring accuracy; they encompass enhanced generalization capabilities and data efficiency. This approach underscores a paradigm shift in visual scoring that leans heavily on emulating human judgment processes over mechanical score prediction. Moreover, the successful unification of IQA, IAA, and VQA tasks under a single LMM framework, termed the O NE A LIGN, suggests potential for future multitask learning models that require less task-specific tuning while being more robust to diverse input domains.

Additionally, the paper highlights this method's capacity to freely combine disparate datasets without performance declines, an advantage over most current models subjected to constrained dataset environments. This characteristic points towards the possibility of creating versatile, generalized models that could handle a myriad of visual content evaluation scenarios with improved accuracy and consistency.

Conclusion

This research presents Q-A LIGN as a viable approach for training large-scale LMMs on visual scoring tasks using discrete text-defined levels—a method that has manifested improved performance and generalization on several visual scoring benchmarks. Looking forward, the fusion of text-defined level-based learning with LLMs may serve as a critical cornerstone in the evolution of visual quality assessment systems, offering structured adaptability and interpretability as integral model features. Such advancements open new avenues for sophisticated, human-aligned machine evaluations across an ever-expanding spectrum of visual content.