Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Abstract: The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.
- Recommendation 500-10: Methodology for the subjective assessment of the quality of television pictures. ITU-R Rec. BT.500, 2000.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Perceptual quality assessment of smartphone photography. In CVPR, 2020.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Aesthetic image captioning from weakly-labelled photographs. arXiv preprint arXiv:1908.11310, 2019.
- Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, June 2018.
- Effective aesthetics prediction with multi-level spatially pooled features. In CVPR, pp. 9367–9375, 2019. doi: 10.1109/CVPR.2019.00960.
- Towards transparent deep image aesthetics assessment with tag-based content descriptors. IEEE TIP, 2023.
- Musiq: Multi-scale image quality transformer. In ICCV, pp. 5148–5157, 2021.
- Vila: Learning image aesthetics from user comments with vision-language pretraining, 2023.
- Photo aesthetics ranking network with attributes and content adaptation. In ECCV, 2016.
- Korhonen, J. Two-level approach for no-reference consumer video quality assessment. IEEE TIP, 28(12):5923–5938, 2019.
- LAION. Aesthetic predictor. https://github.com/LAION-AI/aesthetic-predictor, 2023.
- Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception. IEEE TCSVT, 2022.
- Agiqa-3k: An open database for ai-generated image quality assessment, 2023.
- Quality assessment of in-the-wild videos. In ACM MM, pp. 2351–2359, 2019.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Mmbench: Is your multi-modal model an all-around player?, 2023c.
- No-reference image quality assessment in the spatial domain. IEEE TIP, 21(12), 2012.
- Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013.
- Ava: A large-scale database for aesthetic visual analysis. In CVPR, pp. 2408–2415, 2012.
- Language models are unsupervised multitask learners, 2019.
- Learning transferable visual models from natural language supervision, 2021.
- Live image quality assessment database release 2. http://live.ece.utexas.edu/research/quality, 2005.
- Blindly assess image quality in the wild guided by a self-adaptive hyper network. In CVPR, June 2020.
- A deep learning based no-reference quality assessment model for ugc videos. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 856–865, 2022.
- Nima: Neural image assessment. IEEE TIP, 2018.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Ugc-vqa: Benchmarking blind video quality assessment for user generated content. IEEE TIP, 30:4449–4464, 2021a.
- Rapique: Rapid and accurate video quality prediction of user generated content. IEEE Open Journal of Signal Processing, 2:425–440, 2021b.
- Exploring clip for assessing the look and feel of images, 2022.
- Image quality assessment: from error visibility to structural similarity. IEEE TIP, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861.
- Discovqa: Temporal distortion-content transformers for video quality assessment.
- Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In ECCV, 2022.
- Neighbourhood representative sampling for efficient end-to-end video quality assessment. IEEE TPAMI, 2023a.
- Exploring opinion-unaware video quality assessment with semantic affinity criterion. In International Conference on Multimedia and Expo (ICME), 2023b.
- Towards robust text-prompted semantic criterion for in-the-wild video quality assessment, 2023c.
- Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In ICCV, 2023d.
- Q-bench: A benchmark for general-purpose foundation models on low-level vision. 2023e.
- Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783, 2023f.
- mplug-owl: Modularization empowers large language models with multimodality, 2023a.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023b.
- Patch-vq: ’patching up’ the video quality problem. In CVPR, 2021.
- Coca: Contrastive captioners are image-text foundation models. 2022.
- Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition, 2023a.
- Blind image quality assessment using a deep bilinear convolutional neural network. IEEE TCSVT, 30(1):36–47, 2020.
- Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In IEEE Conference on Computer Vision and Pattern Recognition, 2023b.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.