M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Abstract: Vision-LLMs (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces , a big test to check how well AI systems can answer questions about pictures when they can also look things up online. It focuses on making this work across many languages and cultures, not just English. The goal is to see when “retrieval” (pulling extra info from trusted sources like Wikipedia) actually helps the AI give better, more accurate answers.
What questions are the researchers trying to answer?
The researchers wanted to understand three simple things:
- Does adding extra info from text and images (multimodal retrieval) help more than using text alone?
- Do bigger AI models still benefit from retrieval, or do they rely mostly on what they already know?
- Does changing the language of the question or the retrieved info (for example, using Spanish or Hindi instead of English) change how well the AI performs?
How did they do it?
They designed a huge, realistic test called :
- It includes over 80,000 question–image pairs across 42 languages and 56 dialects. These cover culturally diverse topics like food, sports, people, plants, and traditions.
- They pulled millions of carefully chosen documents (like Wikipedia pages) into a “controlled library” so every experiment has fair, repeatable access to the same information.
- They tested different ways of getting extra info:
- No RAG: The AI sees only the question and picture.
- Ground-truth context: The AI gets perfectly relevant info (the “best case”).
- Text-only RAG: The AI turns the image into text and retrieves matching pages.
- Multimodal RAG: The AI uses both the question and the picture together to retrieve info.
- They tried multiple popular vision–LLMs (VLMs) of different sizes (small to very large).
- They measured accuracy on two benchmarks:
- WorldCuisines: questions about global foods in many languages.
- CVQA: questions about culturally diverse topics from different countries.
- To judge answers, they used a mix of automatic checks and careful review (including an “AI judge” and human validation), so results are reliable.
Think of it like this: You show an AI a picture of a dish and ask, “What is this?” Without retrieval, it may guess based only on the picture. With retrieval, it can look up matching images and descriptions—like a student using a trusted encyclopedia to double-check their answer.
What did they find?
Here are the main results:
- Multimodal retrieval often helps small models: Adding extra info from both text and images boosts performance for smaller AIs. They can use that outside knowledge to get more accurate answers.
- Text-only retrieval can hurt: Just turning pictures into text and searching often adds noise and can make the AI do worse.
- Bigger isn’t always better for RAG: Large models rely heavily on what they already learned during training. They are less likely to change their minds even when given helpful outside info—and they can be misled less by bad info, but they also benefit less from good info.
- Retrieval quality matters a lot: If the retrieved info is relevant and matches the picture and question well, the AI keeps correct answers and fixes many wrong ones. If the info is off-topic, it can push the AI toward wrong answers.
- English bias is real: When prompts and context are in non-English languages, performance often drops, especially for low-resource languages (languages with fewer training examples online). This shows today’s systems are not yet equally strong across all languages.
- Multimodal beats text-only: Using both the picture and the question to retrieve information consistently outperforms using text alone.
Why is this important? It shows that the best way to improve accuracy is not just making models bigger—but getting better at finding and using the right information, especially across languages and cultures.
What does this mean for the future?
This research suggests several clear steps forward:
- Build fairer systems: We need better support for non-English users, so asking questions and getting helpful context in your own language works as well as in English.
- Improve retrieval, not just models: Focus on smarter ways to find high-quality, relevant info that matches both the question and the image.
- Teach models to trust good evidence: Train AIs to integrate outside info more effectively—correct wrong guesses when evidence is strong, and ignore misleading context.
- Use as a shared testbed: Because the benchmark is open-source, other researchers can use it to improve multilingual, multi-cultural, and multimodal AI.
In short, shows that adding the right outside knowledge can make AI smarter and more culturally aware—but only if we retrieve the right information and teach models to use it well, in any language.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper. Each point is formulated so that future researchers can act on it.
- Generalizability beyond Wikipedia: The controlled retrieval environment is built primarily from Wikipedia (April 2025 snapshots). It is unclear how findings transfer to open‑web sources, non‑encyclopedic content, local news, social media, or community knowledge bases with culturally specific coverage.
- Domain breadth vs evaluation scope: Although the benchmark claims coverage across many cultural domains, experiments focus on WorldCuisines and CVQA. The paper does not test whether results (e.g., RAG’s diminishing returns at scale) hold across other domains in the corpus.
- Retrieval pipeline clarity (multimodal content): The multimodal RAG configuration is under‑specified—do retrieved documents include images, how are visual evidences encoded/indexed, and how are image contexts presented to VLMs? A clear ablation on text‑only vs text+image retrieval and consumption is missing.
- Fixed top‑k without sensitivity analysis: All RAG runs use passages. There is no study of sensitivity to , chunk size, passage length, or retrieval depth, nor of adaptive selection or reranking strategies to mitigate noise.
- Indexing and chunking details: Passage segmentation, multilingual tokenization, script handling, and chunking heuristics are not analyzed for their impact on cross‑lingual/multimodal retrieval quality.
- Limited retriever diversity: Only mmE5 and VLM2Vec‑B3 are evaluated. There is no comparison against hybrid lexical‑semantic retrievers, bi‑/multilingual BM25, cross‑encoder rerankers, or multi‑stage pipelines (retrieve‑then‑rerank), especially for low‑resource languages.
- No end‑to‑end training/integration: The generator consumes retrieved context via concatenation; there is no exploration of training-time integration (e.g., cross‑attention fusion, retrieval‑aware finetuning, gating modules, or confidence calibration) to help large models leverage external evidence.
- Large‑model “context susceptibility” remains unaddressed: The paper characterizes diminishing RAG benefits for larger VLMs but does not test mechanisms (trust calibration, self‑verification, selective reading, retrieval‑consistency scoring) to overcome inertial priors.
- Retrieval quality measurement relies on VLM‑as‑judge: Relevance scoring uses a VLM judge with limited description of human validation scope, inter‑annotator agreement, multilingual judge reliability, or standardized IR metrics (e.g., nDCG, MRR). Robustness of the judge across scripts/dialects is not assessed.
- Oracle setups are unrealistic: “Oracle‑Query RAG” (using ground‑truth as query) and “Oracle context” (including model‑generated captions for CVQA) risk overestimating upper bounds. The paper does not quantify biases introduced by model‑generated or translated oracle contexts vs human‑curated gold evidence.
- Caption quality not studied: Text‑based RAG depends on image captions by Qwen2.5‑VL‑72B. The impact of captioner choice, caption fidelity across languages, and errors on downstream retrieval/generation is not analyzed.
- Multilingual prompt/context translation quality: Multilingual prompts and oracle contexts are produced with Gemini and “validated,” but translation quality, cultural fidelity, register/dialect correctness, and their impact on performance are not quantified.
- Query–document language pairing: Beyond aggregate results, there is no systematic study varying (query language) and (document language), e.g., English doc with non‑English prompt, translated doc to user language, or mixed‑language contexts, nor policies for when to translate.
- Script and dialect effects: Performance is not broken down by script (Latin, Cyrillic, Arabic, Devanagari, etc.), orthographic variants, transliteration, or dialect/register features. How these factors affect embedding alignment and retrieval is unknown.
- Low‑resource coverage gaps: The retrieval corpora may have sparse content for certain languages/dialects. There is no analysis of per‑language corpus density, recall, or strategies to supplement with local sources to reduce cultural/linguistic coverage bias.
- Free‑form generation missing: Evaluation uses multiple‑choice accuracy; there is no assessment of open‑ended answers, calibration, citation/attribution correctness, faithfulness to retrieved passages, or evidence‑grounded reasoning quality.
- Evidence utilization metrics: Beyond accuracy, the paper does not measure whether models actually rely on retrieved context (e.g., rationale overlap, quote alignment, source attribution), nor penalize unsupported answers.
- Adversarial and misleading retrieval stress tests: Although WorldCuisines includes adversarial prompts, the paper lacks targeted experiments on retrieval attacks (irrelevant yet plausible contexts, subtle contradictions) and defenses (consistency checks, veracity scoring, provenance filters).
- Multi‑hop cross‑modal retrieval: The benchmark does not test multi‑hop reasoning across multiple documents/modalities (e.g., image→text→image chains), leaving open how multimodal RAG can support compositional, evidence‑chained answers.
- Resource and latency profiling: There is no analysis of retrieval/generation latency, memory/compute footprint, or cost at scale (e.g., mmE5 11B embeddings, multilingual indices), nor trade‑offs between accuracy and efficiency.
- Real‑time/temporal dynamics: The controlled corpus snapshot (April 2025) does not explore time‑dependent cultural knowledge changes, continuous updates, or temporal grounding in retrieval and evaluation.
- Data contamination controls: The paper does not check whether evaluated VLMs have seen CVQA/WorldCuisines or their Wikipedia passages during training, which could inflate baseline performance or confound RAG gains.
- Cultural bias and ethics: The benchmark does not include audits for stereotype amplification, cultural mislabeling, or representational harms, nor guidance on ethical retrieval sources across cultures.
- Per‑language retrieval diagnostics: There is no report of recall@k/precision@k or relevance distributions per language/dialect; without these, it is hard to target retriever improvements in specific linguistic settings.
- Context formatting and length: Effects of prompt templates, context ordering, chunk concatenation, and context window length (and truncation) across languages are not ablated.
- Translation policy for retrieved contexts: The paper shows multilingual contexts can hurt performance but does not test translating retrieved contexts to the instruction language vs preserving original language, or mixed bilingual presentations.
- Generator–retriever alignment: It remains open how to align retriever embeddings with generator internal representations (e.g., shared encoders, adapter layers, retrieval‑aware PEFT) to reduce mismatch across languages/modalities.
- Benchmarks for evidence faithfulness across languages: There is no multilingual, multimodal metric or dataset for evaluating whether the model’s reasoning steps and answers are grounded in the provided evidence in the target language.
- Public release details of judge and annotations: The extent of human validation, judge prompts, and versioning is not fully specified; reproducibility and cross‑model judge consistency are open issues.
Practical Applications
Overview
Based on the paper’s benchmark, analyses, and design recommendations for multilingual, multicultural, multimodal RAG in VQA, the following applications translate the findings into concrete workflows and products across industry, academia, policy, and daily life. Each item notes sectors, potential tools/products, and key dependencies.
Immediate Applications
These can be deployed now using the released dataset/code, existing VLMs, and off‑the‑shelf retrievers.
- Cost‑efficient assistant design: small VLM + multimodal RAG
- Sectors: Software, E‑commerce, Customer Support
- What: Replace large VLMs with smaller multilingual VLMs coupled with strong multimodal retrieval (e.g., mmE5) for image‑grounded Q&A in apps and chatbots.
- Tools/Workflows:
- Multimodal retriever (mmE5) over curated corpora
- RAG gating: only inject context when relevance is high
- Retention/Correction metrics as health KPIs
- Assumptions/Dependencies:
- Access to a relevant, up‑to‑date multilingual corpus
- Engineering support for multimodal embedding pipelines
- Acceptance that large VLMs may not benefit from RAG without tuning
- Pre‑launch evaluation and regression testing of multimodal/multilingual features
- Sectors: Software (AI product QA), AI Vendors, Academia
- What: Use the benchmark and code to A/B test RAG variants, prompts, and retrievers across 42 languages/56 dialects before deployment.
- Tools/Workflows:
- CI/CD benchmark jobs running
- Dashboards for Correctness Retention and Correction Rate by language/modality
- Assumptions/Dependencies:
- Integration of the released dataset/code
- GPU time for periodic evaluations
- Multilingual visual customer support with image attachments
- Sectors: Consumer Electronics, Automotive, Telecom
- What: Handle tickets with photos (e.g., device errors, dashboard lights) and questions in local languages; retrieve manuals/FAQs across languages.
- Tools/Workflows:
- Image + question → multimodal retrieval → evidence‑grounded response
- Language routing: keep internal system prompts in English (per findings), translate user I/O
- Assumptions/Dependencies:
- Domain‑specific corpora (manuals, troubleshooting guides)
- Translation pipeline and prompt templates
- Cross‑border e‑commerce product and food recognition
- Sectors: E‑commerce, Food Tech, Travel
- What: Identify products/dishes from images and return culturally specific names/descriptions localized per market.
- Tools/Workflows:
- Catalog + Wikipedia‑like corpora indexed for multimodal retrieval
- Localization workflow: English internal prompts, localized outputs
- Assumptions/Dependencies:
- High‑quality product/food imagery and metadata
- Cultural taxonomy alignment and name normalization
- Knowledge management and internal search with images
- Sectors: Manufacturing, Enterprise IT
- What: Visual Q&A on parts, diagrams, or engineering photos with cross‑lingual retrieval of SOPs/specs.
- Tools/Workflows:
- Controlled knowledge base snapshots and indexing
- Relevance scoring thresholds to prevent misleading context
- Assumptions/Dependencies:
- Document digitization and multilingual metadata
- Data access controls and PII/IP policies
- Cultural content tagging and moderation
- Sectors: Media Platforms, Social Networks
- What: Identify culturally specific items in user images and add accurate, localized tags or moderation context.
- Tools/Workflows:
- Multimodal retrieval to ground labels and reduce hallucinations
- Language‑aware pipelines; keep instruction prompts in English
- Assumptions/Dependencies:
- Policy‑driven taxonomies; human‑in‑the‑loop review for sensitive content
- Fairness and compliance auditing for multilingual/multimodal systems
- Sectors: Policy/Regulatory, AI Governance, AI Vendors
- What: Quantify performance gaps across languages/dialects and modalities; publish fairness reports.
- Tools/Workflows:
- Use to audit by vitality/bucketed language groups
- Track gaps in accuracy, retention, correction per language
- Assumptions/Dependencies:
- Organizational commitment to report and remediate disparities
- Academic baselines and courseware
- Sectors: Academia, Research Labs
- What: Reproducible baselines for multilingual multimodal RAG, assignments on retrieval quality and cross‑lingual reasoning.
- Tools/Workflows:
- Released dataset + controlled retrieval environment
- Assumptions/Dependencies:
- Compute for evaluation; adherence to dataset licenses (CC‑BY‑SA 4.0)
- Prompting and language policy updates
- Sectors: Software, Contact Centers
- What: Adopt English system prompts internally while localizing user‑facing inputs/outputs to reduce performance drop for non‑English use.
- Tools/Workflows:
- Prompt libraries per language; automatic translation in/out
- Assumptions/Dependencies:
- Acceptable UX trade‑offs; clear disclosure if content is machine‑translated
- Retrieval quality monitoring and fail‑safes
- Sectors: MLOps, Platform Teams
- What: Set relevance thresholds; fallback to “No‑RAG” when retrieval is weak to avoid degrading correct answers.
- Tools/Workflows:
- Lightweight relevance scorers (BM25 + dense reranker)
- Online canaries tracking retention/correction
- Assumptions/Dependencies:
- Real‑time retrieval scoring at low latency
- Observability pipelines
Long‑Term Applications
These require new architectures, training, domain validation, or broader dataset coverage beyond the current benchmarks.
- Large‑model RAG architectures that truly leverage evidence
- Sectors: Software, AI Vendors, Academia
- What: Develop training and fusion methods to overcome “parametric inertia” in large VLMs, improving correction without harming correct baselines.
- Tools/Workflows:
- Evidence‑aware attention, contrastive grounding, retrieval‑conditioned finetuning
- Assumptions/Dependencies:
- Access to large‑scale multilingual multimodal supervision
- Evaluation standards extending
- Cross‑lingual, cross‑modal search engines and cultural heritage explorers
- Sectors: Museums, Media, Libraries, Tourism
- What: Query with a photo and dialectal text to retrieve culturally accurate artifacts and narratives across languages.
- Tools/Workflows:
- Cross‑lingual entity linking; multimodal rerankers; provenance tracking
- Assumptions/Dependencies:
- Rights‑cleared corpora; high‑quality metadata and digitization
- Global catalog and taxonomy alignment for marketplaces
- Sectors: E‑commerce, Supply Chain
- What: Automatically align images, descriptions, and cultural variants of products across markets and languages.
- Tools/Workflows:
- Cross‑lingual category mapping; reference ontology; multimodal RAG
- Assumptions/Dependencies:
- Standardized product schemas; robust de‑duplication
- Multilingual visual tutors and curricular tools
- Sectors: Education, EdTech
- What: Visual Q&A tutors that explain objects, customs, and contexts with region‑appropriate evidence and language.
- Tools/Workflows:
- Curriculum‑aligned corpora; pedagogy‑aware prompting and evaluation
- Assumptions/Dependencies:
- Content alignment with local curricula; accessibility and bias safeguards
- Healthcare information assistants with image + multilingual grounding
- Sectors: Healthcare, Public Health
- What: Patient‑facing educational assistants (e.g., wound care photos with localized instructions), grounded in region‑specific guidelines.
- Tools/Workflows:
- Medical knowledge graphs; safety‑first RAG; clinician‑in‑the‑loop
- Assumptions/Dependencies:
- Regulatory approval; rigorous clinical validation; privacy protections
- Crisis response and humanitarian information systems
- Sectors: Public Safety, NGOs
- What: Multilingual visual Q&A for field images (e.g., signs, hazards) with retrieval of localized emergency guidance.
- Tools/Workflows:
- Offline/edge retrieval caches; trust calibration; provenance scoring
- Assumptions/Dependencies:
- Domain adaptation and robustness under noisy conditions; safety testing
- Robotics and HRI in multicultural environments
- Sectors: Robotics, Smart Devices
- What: Embodied agents that recognize objects visually and retrieve culturally appropriate instructions or warnings in users’ languages.
- Tools/Workflows:
- On‑device multimodal retrieval; fast evidence fusion; dialogue policies
- Assumptions/Dependencies:
- Real‑time constraints; safe failure modes; continual learning
- Multilingual fairness and compliance standards for RAG
- Sectors: Policy/Regulatory, Standards Bodies
- What: Benchmarks and procurement criteria that require measurable performance across low‑resource languages and dialects in multimodal tasks.
- Tools/Workflows:
- Standardized audits (e.g., retention/correction by language/resource level)
- Assumptions/Dependencies:
- Industry adoption; extensions of to more domains
- Misinformation and image fact‑checking across languages
- Sectors: Newsrooms, Platforms, Civil Society
- What: Cross‑lingual retrieval to verify image‑based claims and surface contextual evidence with source attribution.
- Tools/Workflows:
- Provenance chains; multi‑source corroboration; explainable RAG
- Assumptions/Dependencies:
- High‑quality multilingual evidence pools; legal frameworks for content handling
- Domain‑specific multimodal RAG corpora and sandboxes
- Sectors: Regulated Industries (Finance, Healthcare), Enterprise
- What: Controlled retrieval environments for safe experimentation and reproducible evaluations, mirroring the paper’s setup but with domain documents.
- Tools/Workflows:
- Periodic corpus snapshots; policy filters; evaluation harnesses
- Assumptions/Dependencies:
- Data governance and anonymization; security controls
Cross‑cutting Dependencies and Assumptions
- Retrieval quality is decisive: poor context harms correct answers; strong context improves corrections for small/medium models but not necessarily for large models.
- Multimodal retrieval outperforms text‑only pipelines that “caption then retrieve”; avoid naive image‑to‑text conversions for retrieval.
- Language handling matters: current VLMs often perform better with English system prompts and English context internally, even for non‑English user queries; production systems may need translation pipelines and careful UX design.
- Benchmark scope: findings are measured on VQA (WorldCuisines, CVQA) and Wikipedia‑based corpora; generalization to other domains requires adaptation and re‑evaluation.
- Rights and licensing: the benchmark is CC‑BY‑SA 4.0; ensure compliance when mixing with proprietary corpora.
Glossary
- Adversarial prompts: Inputs intentionally designed to mislead or stress-test models by providing misleading context. "intentionally challenging scenarios, such as adversarial prompts where the provided context is misleading"
- Code-switch: Switching between languages within a single response or discourse. "they often code-switch to English in their response"
- Conditioning signal: Additional input that steers a model’s generation toward desired information or context. "the retrieved context is treated as an additional conditioning signal that steers the model toward culturally relevant knowledge"
- Context susceptibility: The degree to which a model’s predictions change when external context is provided. "reduced context integration (or lower context susceptibility)"
- Correction Rate: A metric quantifying how often incorrect baseline answers are fixed after adding retrieved context. "The ``Correction Rate'' measures the percentage of responses that were incorrect without RAG but were successfully corrected by RAG."
- Correctness Retention: A metric quantifying how often correct baseline answers remain correct after adding retrieved context. "The ``Correctness Retention'' rate measures the percentage of responses that were correct without RAG and remained correct with RAG."
- Cross-lingual retrieval: Retrieving relevant information when queries and documents are in different languages. "alignment between cross-lingual retrieval and multimodal representations"
- Dialects and registers: Variants of a language tied to region (dialects) and social context or formality (registers). "covering 42 languages and 56 regional dialects and registers"
- Ground-truth context: Verified, authoritative information provided to the model to establish an upper performance bound. "Ground-Truth Context: The VLM is provided with the ground-truth context, representing an upper bound on performance."
- Indexed document collection: A corpus organized for efficient retrieval via precomputed indices. "compares it against an indexed document collection"
- Inertial priors: Strong internal model beliefs that resist updates from external context. "model scale increases inertial priors"
- LLM-as-a-judge: Using a LLM to evaluate outputs according to predefined rubrics. "we employ an LLM-as-a-judge approach complemented by human validation"
- Macro-averaged accuracy: Accuracy computed by averaging per-class accuracies, treating classes equally. "we use macro-averaged accuracy for all datasets"
- Multilingual parallelism: Equivalent content available across multiple languages to enable controlled cross-lingual analysis. "its extensive multilingual parallelism that enables controlled analysis of cross-lingual retrieval behavior"
- Multimodal encoder: A model component that jointly encodes inputs from multiple modalities (e.g., text and image). "a multimodal encoder ($E_{\text{mm}$) encodes the query (image + question)"
- Multimodal embedding models: Models producing vector representations that integrate multiple modalities for retrieval or similarity. "We test two multimodal embedding models: mmE5 (11B)~\citep{chen2025mme5} and B3 (7B) from VLM2Vec~\citep{jiang2024vlm2vec}."
- Multimodal RAG: Retrieval-Augmented Generation that uses multiple modalities (e.g., text and images) in both retrieval and generation. "With multimodal RAG, the system retrieves culturally specific evidence"
- Multimodal retrieval: Retrieving information using signals from multiple modalities (e.g., visual and textual). "How does multimodal retrieval compare to text-only retrieval in supporting downstream generation?"
- No‑RAG baseline: A configuration where no external retrieved context is provided to the model. "a No‑RAG baseline, where the VLM (M) directly takes the question and image as input"
- Oracle setup: An evaluation setting where perfect supporting information is assumed to be available. "and can even approach or match an ``Oracle'' setup that has perfect supporting information."
- Oracle‑Query RAG: A text-based retrieval variant using ground-truth context as the retrieval query. "Oracle-Query RAG: The VLM uses the ground-truth context as the query to retrieve passages."
- Parametric knowledge: Knowledge stored within model parameters rather than provided via external context. "stronger reliance on parametric knowledge"
- Retriever: The component that selects relevant passages from a corpus for downstream generation. "First, a retriever selects the top- most relevant passages from the corpus:"
- Retrieval relevance score: A measure of how pertinent retrieved content is to the query and task. "the average retrieval relevance score"
- Retrieval-Augmented Generation (RAG): A paradigm that enriches model outputs with information retrieved from external sources. "Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information;"
- Temporal alignment: Synchronizing corpus snapshots with dataset timelines to preserve contextual fidelity. "to ensure broad thematic coverage and temporal alignment"
- Top‑: Selecting the k most relevant items in retrieval or ranking. "we retrieve the top- passages with ."
- Vision–LLMs (VLMs): Models that jointly process and reason over visual and textual inputs. "Vision–LLMs (VLMs) have achieved strong performance in visual question answering (VQA)"
- Visual question answering (VQA): Tasks where models answer questions about images. "visual question answering (VQA)"
- Zero-shot: Evaluating models without task-specific training or fine-tuning. "We evaluate each model under both zero-shot and retrieval-augmented settings"
Collections
Sign up for free to add this paper to one or more collections.