Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multimodal Large Language Models as Image Classifiers

Published 6 Mar 2026 in cs.CV | (2603.06578v1)

Abstract: Multimodal LLMs (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-LLMs report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.

Summary

  • The paper reveals that discrepancies in MLLM performance stem from evaluation protocols and label noise, rather than intrinsic model limitations.
  • It evaluates five advanced MLLMs using open-world, multiple-choice, and modified closed-world paradigms with a novel multilabel reannotation benchmark (ReGT).
  • Improvements in ground truth quality yield significant accuracy gains, underscoring MLLMs' potential for robust image classification and dataset curation.

Multimodal LLMs as Image Classifiers

Introduction

The paper "Multimodal LLMs as Image Classifiers" (2603.06578) presents a comprehensive analysis of the classification capabilities of Multimodal LLMs (MLLMs) on standardized computer vision benchmarks, notably ImageNet-1k. The study addresses a pervasive issue in recent literature: contradictory reports on MLLM performance relative to vision-only and vision-LLMs, identifying these discrepancies as artifacts of evaluation protocol design and label quality rather than intrinsic model limitations. By systematically evaluating five state-of-the-art MLLMs, including both closed- and open-source models, across multiple classification paradigms and introducing a novel multilabel reannotation (ReGT) for ImageNet-1k, the paper provides robust insight into MLLM behavior, task sensitivity, and label noise impact. Figure 1

Figure 1: Challenging visual recognition cases from ImageNet-1k, highlighting annotation ambiguities that affect model scoring.

Classification Task Protocols

The paper formalizes three main evaluation paradigms for MLLM image classification:

  • Open-World (OW): The model generates a free-form label, which is mapped to the class set via embedding-space nearest neighbor search. OW closely mirrors real-world deployment but has previously suffered from poor mapping strategies.
  • Multiple-Choice (MC): The model selects from a limited set of candidate class names, with distractors strategically sampled to probe robustness. Prior studies reported minimal impact from hard distractors, while the current analysis demonstrates up to 15% accuracy drop on the newest MLLMs with rigorous distractor selection.
  • Closed-World (CW): The model is prompted with the exhaustive class list, mimicking conventional supervised evaluation. Previous studies avoided this setup due to input token length and frequent out-of-prompt (OOP) predictions. The authors introduce CW+, an embedding-based post-processing step to resolve OOP cases efficiently. Figure 2

    Figure 2: Overview of the OW, MC, and CW(+) classification paradigms with task-specific input/output constraints.

Impact of Ground Truth Quality

Ground truth (GT) quality is shown to be the dominant factor in MLLM performance assessment. ImageNet-1k, despite its reputation, suffers from significant annotation noise: 20% label error rate, 40% multi-object images, substantial semantic ambiguity, and distribution shifts. The authors address this by reannotating 625 classes with multilabel assignments, creating the ReGT benchmark. Models less anchored to supervised signals, such as MLLMs and VLMs, benefit most from improved GT (+6–11% accuracy, compared to <2% for supervised models), substantially narrowing the gap with vision-only baselines. Figure 3

Figure 3: Accuracy change from official ImageNet (ImGT) to reannotated labels (ReGT), demonstrating model family sensitivity to annotation improvements.

Results and Observations

Performance analysis across ReGT and ImGT reveals:

  • Supervised Models: EVA-02, EfficientNet-L2/V2 retain top accuracy on single-label images, but benefit minimally from reannotation, suggesting strong label anchoring.
  • MLLMs and VLMs: GPT-4o, Qwen3-VL, InternVL3.5, LLaVA-OV, and PaliGemma 2 show the largest gains from ReGT, indicating robustness to semantic corrections and annotation noise. Notably, on subsets with ImGT–ReGT disagreement, VLMs and MLLMs outperform supervised models.
  • Evaluation Protocol Sensitivity: MC setups with randomly-sampled distractors inflate scores. Protocol variants (batch size, image ordering, text encoder) induce up to 10% variation in reported accuracy, undermining prior claims made under less stringent settings.
  • OW vs CW: With improved mapping in OW and CW+ postprocessing, OW outperforms CW for several MLLMs, contradicting previous literature. OOP prediction rates correlate with mapping-induced gains.
  • Annotation Assistance: In a controlled study, annotators preferred or combined MLLM predictions with ReGT labels in ~50% of disputed cases, underscoring MLLM utility for dataset curation. Figure 4

    Figure 4: Mapping OOP predictions in CW(+) via embedding-space nearest neighbor resolves misaligned outputs, improving accuracy especially in ambiguous cases.

Robustness to Fine-Grained Annotation

Annotations for fine-grained animal classes (e.g., mustelidae, "weasel family") were performed by domain experts. The results show that even highly-capable supervised models exhibit substantial drops in recall on expert-corrected labels, while MLLMs and VLMs realize strong gains. This further supports the hypothesis that MLLMs and VLMs possess stronger resilience to annotation imperfections, particularly within ambiguous class boundaries. Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: Representative examples from the weasel-family annotation study highlighting difficulty in species-specific recognition.

Annotation Case Study

A targeted case study explored the role of MLLMs in human labeling workflows. When presented with images where GPT-4o and ReGT disagreed, annotators replaced or augmented the ground truth with MLLM predictions in approximately half of cases, demonstrating substantial residual noise even after expert curation and highlighting the practical synergy between human and model annotation. Figure 6

Figure 6: Annotation pipeline visualization, showcasing scenarios where human annotators selected GPT-4o predictions over previously established labels.

Error and Correlation Structure

Cross-model correlation analysis reveals divergent error patterns among supervised, self-supervised (VLM), and MLLM families. MLLMs cluster separately in both per-class recall and image-level correctness, confirming that their error modes differ fundamentally from vision-centric models. Figure 7

Figure 7

Figure 7: Correlation matrices comparing model family behavior over reannotated labels, revealing weak alignment between MLLMs and vision-only models.

Practical and Theoretical Implications

The study demonstrates that perceived shortcomings of MLLMs in image classification often stem from flawed protocols and label noise, rather than architectural deficiencies. Embedding-based mapping in OW and CW+ setups enables competitive or superior performance relative to vision-language baselines, especially as GT quality improves. The findings signal that MLLMs can be reliably leveraged for large-scale annotation and dataset construction, and suggest that future benchmarks should prioritize semantic clarity and multilabel protocols.

Conclusion

This paper establishes that MLLM classification accuracy is highly contingent on evaluation protocol and label quality. Strong numerical results confirm that state-of-the-art MLLMs can match or surpass vision-language and supervised models given accurate GT and principled protocol design. The research highlights the imperative for cleaner datasets, better mapping strategies, and integration of MLLMs into annotation workflows as assistants. Theoretically, this work motivates further exploration of embedding alignment, task formulation sensitivity, and robustness to annotation imperfections in multimodal models, with clear implications for both benchmarking and real-world deployment.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper looks at how well multimodal LLMs (MLLMs)—AI systems that can understand both pictures and text—can recognize what’s in images. The authors discover that the way we test these models, and how correct the “answer key” is, makes a huge difference. They also fix several testing problems and show that many past comparisons were unfair, which made MLLMs look worse than they really are.

What questions did the researchers ask?

In simple terms, they wanted to know:

  • Are we testing image-recognition AIs in a fair and accurate way?
  • How much do testing choices (like question style) change the results?
  • Do mistakes in the “ground truth” (the official correct labels) make models look bad when they’re actually right?
  • Can MLLMs help humans clean up and improve big image datasets?

How did they study it?

The team tested five well-known MLLMs (including GPT-4o and top open-source models) and compared them to traditional vision models. They ran the same image classification task in three different test styles, then improved the label quality and re-ran the tests.

Three ways to test image recognition

Think of testing like a quiz about what’s in a picture:

  • Open-World (OW): The model gives any answer it wants (like a short description). Then, the system matches that answer to the closest official label by comparing the meaning of words. You can think of this like matching “notebook” to “laptop” because they mean almost the same thing.
  • Multiple-Choice (MC): The model chooses from four options (one correct, three “distractors”). This is the classic school-style quiz.
  • Closed-World (CW): The model must pick from the full list of 1,000 class names. That’s like giving it the entire answer list and saying “choose exactly one from here.” A common problem is that models sometimes ignore the list and write something else. The authors fix this with CW+, which automatically maps any “off-list” answer to the closest allowed class (using the same meaning-matching trick as in OW).

Fixing the answer key (better labels)

They noticed ImageNet, a famous picture dataset, often has:

  • Wrong labels (about 1 in 5 in some studies),
  • Images with multiple objects but only one label,
  • Overlapping or confusing class names.

So the authors re-labeled a large part of ImageNet (625 out of 1,000 classes; 31,250 images total) to create a cleaner, more realistic “answer key” called ReGT. This allows fairer testing—especially on pictures that contain more than one object.

Other testing choices they examined

They also measured how much results change based on things people usually overlook, like:

  • How many images are processed at once (batch size),
  • The order of images,
  • Which text encoder they use to compare meanings (important for OW and CW+),
  • How distractor choices are picked in multiple choice.

These seemingly small choices can change accuracy a lot.

What did they find?

Here are the key findings, explained simply:

  • The test format matters—a lot:
    • Multiple-Choice (MC) can make models look better than they really are, especially if the wrong options (distractors) are easy or random. When the authors used harder, more realistic distractors, accuracy dropped by about 10–15%.
    • Open-World (OW) can be very strong if you use a good meaning-matching step to map free-form answers to the official class names. For about half the models they tested, OW performed as well as or better than CW.
    • Closed-World+ (CW+) fixes a big problem: when models answer with something not on the list (“off-list” or “hallucinated”), CW+ smartly maps that answer to the closest allowed class. This boosts accuracy and makes CW fairer.
  • Better labels make MLLMs look much better:
    • With the improved labels (ReGT), MLLMs gained up to +10.8% accuracy. That’s a big jump.
    • The gap between MLLMs and top traditional models (trained specifically for image tasks) shrinks a lot when labels are cleaned up. In many “disputed” images where the original label was questionable, MLLMs and vision-LLMs actually did better than standard supervised models.
  • Design choices change scores:
    • Things like batch size, image order, output format, and which text encoder you use can noticeably change results. This means some past studies may have unintentionally given inflated or deflated scores.
  • MLLMs can help humans fix datasets:
    • In a focused study of tough cases, human annotators agreed with or used the MLLM’s suggestion about half the time. This shows MLLMs can be powerful assistants for cleaning and improving large image datasets.

Why does this matter?

  • Fairer evaluations: The study shows that many “MLLMs are bad at classification” claims came from flawed tests or noisy labels—not necessarily weak models. With better testing and cleaner labels, MLLMs perform much closer to traditional top models.
  • Better real-world reliability: If you’re using AI to recognize things in important settings (like nature safety, medicine, or everyday tools), you want tests that reflect reality. This paper shows how to set up those tests so they’re fair and trustworthy.
  • Smarter data curation: Since MLLMs can help spot bad or missing labels, they can speed up the creation of better training data, which helps all future models.
  • Practical takeaway: If you build or evaluate AI image classifiers, don’t just pick an easy multiple-choice setup with random distractors. Use careful mapping for free-form answers, consider CW+, and make sure your “answer key” is correct. Small choices can change the score—and the conclusions—by a lot.

In short, the paper argues that MLLMs are stronger image classifiers than many people think, as long as we test them the right way and use accurate labels.

Knowledge Gaps

Below is a concise, actionable list of the knowledge gaps, limitations, and open questions that remain unresolved by the paper.

  • Limited scope of reannotation (ReGT): only 625/1000 ImageNet-1k classes were relabeled, deliberately excluding many fine-grained categories; it is unclear whether conclusions hold for the remaining 375 classes or for other fine-grained domains.
  • ReGT availability and reproducibility: ReGT is “not yet public,” preventing independent verification, replication, and benchmarking; timelines and release protocols (including class-equivalence sets E, prompts, and code) are unspecified.
  • Potential annotation bias: annotators were shown top-20 model predictions during labeling; the magnitude and direction of this bias (confirmation effects, class prevalence shifts) were not quantified via a blinded control or A/B design.
  • Annotation reliability not reported: inter-annotator agreement (e.g., Cohen’s kappa, Krippendorff’s alpha), adjudication protocols, and second-pass QC procedures beyond the case study are not provided.
  • Evaluation choice for “N” images: treating all predictions as correct when an image has no valid ImageNet label (N category) can inflate accuracy; alternative metrics (e.g., “none-of-the-above” detection, abstention-aware scoring) are not explored.
  • Completeness of semantic equivalence sets (E): the coverage, construction criteria, and validation of E (class synonyms/aliases) are not detailed; their impact on accuracy and model rankings is unquantified.
  • Closed-World mapping design (CW+): reliance on nearest-neighbor mapping in a chosen text-embedding space is under-specified; sensitivity to encoder choice, normalization, and distance metrics needs systematic analysis and standardization.
  • Alternatives to CW+ mapping: constrained decoding (e.g., grammars, function-calling), logit-biasing, or vocabulary-constrained decoding were not compared to embedding-based mapping for reducing out-of-prompt (OOP) outputs.
  • Encoder selection fairness: “best model-specific encoder” is used for mapping but selection criteria and potential cherry-picking risks are unclear; cross-encoder comparisons and a fixed, model-agnostic encoder baseline are needed.
  • Prompt sensitivity and decoding parameters: the influence of system prompts, few-shot examples, temperature/top_p, and response format constraints on CW/OW performance and OOP rates is not systematically quantified.
  • Batch/ordering effects: while noted to be significant, the mechanisms and generalizability of batch size, image order, and class-homogeneous vs. mixed batches across APIs and models remain underexplored and lack standardized controls.
  • Multiple-choice (MC) protocol standardization: only 4-way MC is studied, with distractors from random or EVA-02 confusion; scaling to larger N, model-agnostic/human-curated/adversarial distractors, and super-class or synonym-aware hard negatives is left open.
  • Distractor-source dependence: using EVA-02’s confusion matrix may bias hardness towards vision-only supervised errors; it is unknown whether MLLM-specific confusion-derived distractors would change conclusions.
  • OW vs. CW behavior: OW outperforms CW for some models with embedding mapping, but the underlying reasons (e.g., generative priors vs. selection bias) and when to prefer OW vs. CW remain unclear.
  • Multi-label evaluation: the study largely enforces single-label outputs for datasets with substantial multilabel images; explicit multi-label prediction objectives, metrics (mAP, F1), and instruction strategies are not investigated.
  • Abstention and “unknown” handling: there is no evaluation of calibrated abstention or “none-of-the-above” options, despite frequent out-of-taxonomy content; methods to elicit calibrated uncertainty from MLLMs are absent.
  • Cross-dataset generalization: results are restricted to ImageNet-1k (plus a niche biological subset in Supplementary); performance under distribution shift (ImageNet-V2, -R, -A, -Sketch, ObjectNet) is not reported.
  • Fine-grained taxonomy coverage: many fine-grained animal categories were excluded from ReGT; how MLLMs fare on expert-verified fine-grained datasets and whether CW+/OW mapping scales to dense taxonomies is unaddressed.
  • Class name quality and variants: dependence on a single hand-crafted class-name list (with known imperfections) is acknowledged, but the effect of alternative labels, paraphrases, hierarchical names, or learned templates is not quantified.
  • OOP diagnostics and mitigation: beyond rate reporting, the causes of OOP (prompt ambiguity, token limits, training distribution) and robust mitigations (structured outputs, grammar constraints, self-checking) need deeper evaluation.
  • Output calibration and confidence: no assessment of probability calibration, score-based ranking, or selective classification; top-k accuracy and confidence-accuracy alignment are not analyzed.
  • Case-study limitations: the second-pass annotation examined only GPT-4o disagreements, focuses on single-label cases, and lacks sample-size details and inter-annotator agreement; generality to other MLLMs is unknown.
  • Token-length constraints and scaling: some models cannot accommodate 1,000-class CW prompts; chunked prompting, retrieval-augmented class selection, or hierarchical narrowing strategies were not evaluated.
  • Cost, latency, and practicality: throughput, API variability, rate-limiting, and monetary cost of large-scale MLLM classification (and sensitivity of results to these factors) are not reported.
  • Risk of data leakage: while ReGT is claimed unseen, no explicit contamination checks (e.g., web-scale pretraining overlaps, near-duplicate detection) are presented for models trained on internet-scale corpora.
  • Error taxonomy and per-class analysis: the paper highlights that ImGT vs. ReGT correct sets differ but does not provide a detailed error taxonomy (e.g., hypernym/hyponym confusions, context-induced errors) to guide targeted improvements.
  • Multilinguality: all evaluations use English labels; MLLM/VLM performance with multilingual class names and cross-lingual mapping in CW+/OW remains unexplored.
  • Beyond ImageNet label-space constraints: the paper notes issues when the dominant object is out-of-vocabulary; structured open-world protocols (e.g., hierarchical or expandable taxonomies) are not developed or benchmarked.

Practical Applications

Below is an overview of practical applications that follow from the paper’s findings, methods, and innovations. Items are grouped into immediate (deployable now) and longer-term (requiring further R&D, scaling, or standardization). Each item notes likely sectors, candidate tools/workflows, and key assumptions/dependencies.

Immediate Applications

  • Bold: CW+ semantic mapping adapter for MLLMs
    • What: Wrap any MLLM’s free-form output and map it to a fixed class list via nearest-neighbor search in a robust text-embedding space (model-specific encoder), reducing out-of-prompt (OOP) errors when using MLLMs as classifiers.
    • Sectors: software, robotics, retail/e-commerce (catalog tagging), media/content moderation, manufacturing QA.
    • Tools/workflows: lightweight library/API that takes (image, class list) → MLLM text → text-embedding → nearest label; synonym/equivalence tables; fallback “unknown” option; batch evaluation mode.
    • Assumptions/dependencies: access to a strong text encoder (ideally model-specific); curated class lists and synonym groups; reliable prompting; compute budget for embedding.
  • Bold: Open-World (OW) to taxonomy bridge for user-facing apps
    • What: Use the OW setup plus embedding-based mapping to translate free-form descriptions to internal taxonomies/codes.
    • Sectors: photo apps/DAM, e-commerce tagging, newsroom/media archiving.
    • Tools/workflows: OW prompting templates; embedding-based mapper; thresholds for multi-label assignment and “no match.”
    • Assumptions/dependencies: taxonomy coverage and maintenance; tuned thresholds; handling ambiguous/long-tail items.
  • Bold: Evaluation harness that avoids inflated MC scores
    • What: Replace MC tests with weak distractors by either CW+ or OW+mapping; if MC is unavoidable, generate hard distractors (confusion-matrix or semantically close distractors) and report confidence intervals.
    • Sectors: enterprise AI procurement, benchmark providers, academic labs.
    • Tools/workflows: MC-hard distractor generator; CW+/OW scripts; CI-ready evaluation configs; reporting templates with CIs.
    • Assumptions/dependencies: access to a strong supervised model (e.g., EVA-02) to derive confusion-based distractors; standardized prompt templates.
  • Bold: Reproducible evaluation checklists and configs
    • What: Standardize batch size, image ordering, prompt design, and text-encoder choice—all shown to significantly affect accuracy.
    • Sectors: academia, ML platform teams, model governance/compliance.
    • Tools/workflows: versioned evaluation configs; MLOps pipelines with fixed seeds; automated variance reporting across runs.
    • Assumptions/dependencies: organizational buy-in to standardize; stable model APIs.
  • Bold: MLLM-in-the-loop dataset curation
    • What: Deploy MLLMs to flag potential label errors, suggest added/missing labels, and assist annotators—shown to meaningfully improve labels in ~50% of difficult cases.
    • Sectors: dataset providers, e-commerce, medical imaging research, biodiversity projects, AV/perception teams.
    • Tools/workflows: integration with CVAT/Label Studio; reviewer UIs showing top-20 predictions; multilabel support; disagreement triage queues.
    • Assumptions/dependencies: human oversight (“verify not replace”); privacy/compliance for images; domain-expert availability for fine-grained classes.
  • Bold: Multilabel-aware metrics and synonym handling
    • What: Adopt a ReaL-style metric that accepts any correct label in a multilabel set and incorporates synonym/equivalence groups to avoid penalizing naming variants.
    • Sectors: benchmark design, search/retrieval systems, QA for content platforms.
    • Tools/workflows: metrics library that supports S/M/N subsets (single/multi/no-label), synonyms E, and intersection-of-correctness analysis.
    • Assumptions/dependencies: curated synonym/equivalence lists; taxonomy updates over time.
  • Bold: Performance slicing dashboards using label categories
    • What: Partition evaluation into S+, S−, M+, M−, N categories to diagnose failure modes (e.g., OOP spikes on multilabel or out-of-scope images).
    • Sectors: MLOps, model risk teams, enterprise QA.
    • Tools/workflows: dashboards that ingest predictions and labels, compute category-wise metrics, and visualize OOP rates and mapping corrections.
    • Assumptions/dependencies: availability of multilabel ground truth and “no-label” flags; logging and governance.
  • Bold: Safer deployment guidance for sensitive use-cases
    • What: Apply conservative evaluation (CW+/OW), hard MC when required, and human verification gates before using MLLMs for safety-critical tasks (e.g., medical triage, dangerous species ID).
    • Sectors: healthcare, conservation/ecology, industrial inspection.
    • Tools/workflows: risk checklists; uncertainty thresholds; mandated human-in-the-loop for edge cases; audit trails.
    • Assumptions/dependencies: risk appetite; regulatory constraints; qualified reviewers.
  • Bold: Rapid relabeling playbook for in-house datasets
    • What: Replicate the paper’s reannotation principles—multilabel, ambiguity-aware, and “no valid label” tagging—on internal datasets to reduce label noise and shift reported performance closer to reality.
    • Sectors: enterprises with proprietary vision datasets (retail, manufacturing, logistics).
    • Tools/workflows: annotation guidelines; class-equivalence catalog; sampling policies; dual-pass verification for disputed items.
    • Assumptions/dependencies: annotation budget; acceptance of model-assisted labeling bias trade-offs; domain experts for fine-grained categories.

Long-Term Applications

  • Bold: Public, multilabel reannotations for major benchmarks
    • What: Scale ReGT-style efforts across datasets and domains to standardize multilabel, ambiguity-aware evaluation and reduce label noise.
    • Sectors: academia, standards bodies, benchmark consortia.
    • Tools/workflows: community curation pipelines; consensus-building on synonym sets; dataset versioning standards.
    • Assumptions/dependencies: funding and coordination; licensing; broad community adoption.
  • Bold: OOP-aware decoding and alignment for classification
    • What: Train/fine-tune MLLMs to minimize OOP (e.g., constrained decoding against class lists, better instruction tuning) while retaining general utility.
    • Sectors: model developers, embedded/edge AI, robotics.
    • Tools/workflows: constrained decoders; alignment objectives penalizing OOP; evaluation suites validating both generalization and compliance.
    • Assumptions/dependencies: access to model weights or adapter methods; token-length constraints; careful trade-offs between flexibility and faithfulness.
  • Bold: Dynamic open-world classification with novelty detection
    • What: Production systems that map free-form outputs to known classes, detect “unknowns,” and update taxonomies incrementally with human approval.
    • Sectors: retail (new SKUs), media libraries, defense/security monitoring, scientific curation.
    • Tools/workflows: thresholded embedding similarity; retrieval-augmented labeling; taxonomy lifecycle management; continuous evaluation.
    • Assumptions/dependencies: robust calibration; human review queue; governance for adding classes.
  • Bold: Automated dataset quality auditors
    • What: MLLM-powered tools that scan datasets to surface likely mislabels, duplicates, overlap, or OOD images; produce relabel suggestions and confidence levels.
    • Sectors: MLOps vendors, large data-owning enterprises.
    • Tools/workflows: auditor dashboards; integration with data version control; active-learning hooks to prioritize human review.
    • Assumptions/dependencies: scalable inference; access to data (privacy); acceptance of suggested-change workflows.
  • Bold: Sector-specific synonym and taxonomy services
    • What: Domain-tailored equivalence/thesaurus services (e.g., medical ontologies, retail catalogs, wildlife taxonomies) feeding evaluation and mapping pipelines.
    • Sectors: healthcare, e-commerce, ecology, geospatial.
    • Tools/workflows: APIs for equivalence sets and hierarchies; periodic updates with expert input; governance for changes.
    • Assumptions/dependencies: expert curation; alignment with existing standards (e.g., SNOMED, GBIF).
  • Bold: Benchmarking policy and certification standards
    • What: Procurement and regulatory frameworks that discourage inflated MC protocols and require CW+/OW evaluations, hard distractors, and multilabel-aware metrics.
    • Sectors: regulators, public sector procurement, large enterprises.
    • Tools/workflows: certification checklists; compliance audits; disclosure of evaluation configs and seeds.
    • Assumptions/dependencies: cross-institution consensus; incentives for vendors to comply.
  • Bold: Training curricula and practitioner guides
    • What: Incorporate findings on protocol sensitivity, label noise, and mapping methods into university courses and enterprise training for responsible deployment.
    • Sectors: education, corporate L&D, professional organizations.
    • Tools/workflows: modular teaching materials; lab assignments with CW+/OW and MC-hard; case studies of label noise impacts.
    • Assumptions/dependencies: open-source teaching resources; dataset access.
  • Bold: Video and streaming pipelines with multilabel semantics
    • What: Extend CW+/OW mapping and multilabel metrics to streaming/video classification for surveillance, manufacturing lines, and sports/media tagging.
    • Sectors: security, industrial automation, media analytics.
    • Tools/workflows: batched temporal inference; segment-level mapping; per-frame uncertainty aggregation.
    • Assumptions/dependencies: real-time constraints; temporal label definitions; edge deployment capabilities.

Notes on feasibility and transferability:

  • The strongest immediate gains come from CW+/OW mapping, better distractor design, and multilabel-aware evaluation—these require no model retraining.
  • Benefits depend on the availability and quality of text encoders and class/synonym catalogs; using model-specific encoders typically performs best.
  • Label improvements materially change conclusions about model capability; investing in better labels can rival gains from model upgrades for MLLMs.
  • For safety-critical domains, evaluation rigor and human oversight are non-negotiable; MC with random distractors should not be used to justify deployment.

Glossary

  • Annotation noise: Incorrect or inconsistent labels in a dataset that can mislead training or evaluation. "greatly reducing annotation noise, ambiguities and other imperfections mentioned."
  • Batch size: The number of samples processed together in one forward/backward pass, which can affect model behavior and reported accuracy. "batch size, image ordering, and text encoder selection"
  • Closed-World (CW): A classification setting where the model must choose from an exhaustive, predefined list of classes. "The Closed-World (CW) task is designed to mimic classification with supervised models and \acp{vlm}."
  • Confidence intervals (CI): Statistical ranges that quantify uncertainty around an estimate, such as accuracy. "with 95\,\% confidence intervals (CI)."
  • Confusion matrix: A matrix summarizing how often each true class is predicted as each other class, used here to sample hard distractors. "based on the confusion matrix of the supervised EVA-02 model, denoted as confEVA(cc)"
  • Constrained decoding: Forcing a generative model to produce outputs that conform to a specified set of tokens or structures. "without costly constrained decoding"
  • Distractors: Incorrect answer options included alongside the correct one in multiple-choice evaluation to test model discrimination. "Distractors are sampled per-image, ranging from random to more challenging alternatives semantically close to the ground-truth label."
  • Distribution shift: A change between training and test data distributions that can degrade model performance. "distribution shifts between training and validation sets"
  • Embedding-based post-processing: Mapping free-form model outputs to target classes via vector embeddings and nearest neighbors. "a lightweight embedding-based post-processing that resolves out-of-prompt predictions"
  • Equivalence set: A predefined grouping of semantically interchangeable labels treated as the same during evaluation. "We treat each pair in a predefined equivalence set EE as interchangeable"
  • Free-form outputs: Unconstrained natural language responses produced by a model, not limited to a fixed class list. "\acp{mllm} produce free-form outputs"
  • Ground truth (GT): The reference labels used to evaluate predictions. "classification performance depends critically on evaluation protocol and ground truth quality."
  • Hallucinations: Model outputs that are plausible but not allowed or not supported by the prompt/data, such as generating labels outside the provided list. "often referred to as hallucinations"
  • ImGT: The original ImageNet-1k ground-truth label for an image. "ImGT: original single ImageNet label"
  • Instruction following: The capability of models to adhere to user prompts and directions during multimodal tasks. "primarily assessing high-level multimodal reasoning and instruction following through multiple-choice question answering."
  • Instruction tuning: Post-training on instruction–response pairs to improve following directions and task performance. "Progressive multimodal scaling + instruction tuning"
  • k-NN classifier: A nonparametric method that classifies based on the labels of the k nearest neighbors in feature space. "as a kk-NN classifier using ImageNet-1k training set (ViT-7B, k=11k{=}11)"
  • Label noise: Systematic or random inaccuracies in labels that can bias learning and evaluation. "Label Noise Sensitivity Across Learning Paradigms."
  • Mixture-of-Experts (MoE): A neural architecture that routes inputs to specialized expert sub-networks for efficiency and capacity. "Qwen3 MoE"
  • Multiple-Choice (MC): An evaluation setup where the model selects one answer from a small set of candidates including distractors. "Multiple-choice (MC) question answering is commonly used to benchmark \acp{mllm}"
  • Multimodal LLM (MLLM): A LLM augmented with visual (and possibly other) modalities to process and reason over multiple input types. "\acp{mllm} performance has been extensively benchmarked"
  • Nearest-neighbor search: Assigning labels by finding the closest vector (e.g., text embedding) in a predefined set. "or nearest-neighbor search in a text-embedding space"
  • Open-World (OW): A setting where the model generates a free-form description that is later mapped to the closest class. "Open-World (OW) is closest to everyday \ac{mllm} use"
  • Out-of-distribution (OOD): Data that do not come from the same distribution as the training or assumed evaluation set. "our non-OOD reannotated labels do not contain the ground-truth label"
  • Out-of-prompt (OOP) predictions: Outputs that do not belong to the set of allowed classes specified in the prompt. "out-of-prompt (OOP) predictions become more frequent"
  • ReaL accuracy: A multilabel-aware ImageNet metric where a prediction is correct if it matches any valid label for an image. "we adapt the top-1 ReaL accuracy"
  • ReGT: The paper’s reannotated ground-truth labels for a subset of ImageNet-1k, used to reduce noise and ambiguity. "ReGT: our reannotations."
  • Reinforcement learning from human feedback (RLHF): Post-training technique aligning model behavior using preferences or feedback from humans. "post-training alignment (RLHF)"
  • Self-supervised: Training without human-provided labels, using pretext objectives to learn representations. "self-supervised \acp{vlm} and \acp{mllm} outperform them on images where ReGT disagrees with ImGT."
  • Text encoder: A model that converts text into vector embeddings for comparison or mapping. "text encoder selection"
  • Vision-LLM (VLM): A model trained jointly on images and text to align visual and textual representations. "both supervised and \acp{vlm}"
  • Vision Transformer (ViT): A transformer-based architecture applied to image patches for vision tasks. "SigLIP-so400M denotes the shape-optimized Vision Transformer."
  • Zero-shot: Performing a task without task-specific training on the target labels, typically via prompt-based or embedding methods. "this approach would no longer be zero-shot"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 103 likes about this paper.