Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework

Published 24 Nov 2025 in cs.AI and cs.LG | (2511.20701v1)

Abstract: While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based LLMs. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types, with commonsense reasoning presenting particular challenges. This work provides practical insights for researchers implementing multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.

Summary

  • The paper demonstrates a significant performance drop in MM-CoT on open-domain datasets, revealing challenges in numeric, commonsense, and world-knowledge reasoning.
  • The methodology adapts the Amazon CoT pipeline with custom preprocessing, prompt engineering, and gated vision-language fusion to standardize diverse dataset formats.
  • Experimental results show accuracies between 14.30% and 32.00% on non-science tasks, underscoring the need for domain-specific MM-CoT enhancements.

Cross-Domain Evaluation of Multimodal Chain-of-Thought Reasoning in the Amazon CoT Framework

Introduction

This work investigates the generalizability of Multimodal Chain-of-Thought (MM-CoT) reasoning frameworks, originally designed for scientific question answering (ScienceQA), to diverse datasets necessitating numeric, commonsense, and world-knowledge reasoning. Specifically, the Amazon CoT architecture, characterized by a two-stage rationale generation and answer inference paradigm leveraging T5-based LLMs and gated fusion of vision features, is systematically evaluated on three external datasets: ChartQA, OK-VQA, and A-OKVQA. The forced domain shift testing is motivated by the paucity of comprehensive empirical validation of MM-CoT frameworks beyond structured scientific benchmarks.

Datasets and Their Cross-Domain Challenges

ChartQA, OK-VQA, and A-OKVQA each stress particular facets of multimodal reasoning:

  • ChartQA: Consists of synthetically generated charts (bar, pie, line) requiring structured visual interpretation and numeric reasoning over tabular data. The evaluation probes the MM-CoT model's ability to adapt to geometric, symbolically dense visual information versus natural scenes. Figure 1

    Figure 1: ChartQA features structured, synthetic chart images designed to elicit numeric and trend-based reasoning.

    Figure 2

    Figure 2: ChartQA queries require stepwise language-based rationales for numerically grounded inferences.

  • OK-VQA: Presents open-ended questions on natural images demanding external world knowledge and commonsense. The task evaluates if rationales generated via MM-CoT can transcend the immediate visual context and capture factual knowledge. Figure 3

    Figure 3: Example of OK-VQA’s open-ended, knowledge-driven question and rationale prompt format.

  • A-OKVQA: Encompasses ambiguous, multi-annotator samples enriched with human-written rationales, requiring both visual grounding and integration of world knowledge. The presence of annotated rationales allows assessment of MM-CoT's rationale fidelity and answer selection in open-domain contexts. Figure 4

    Figure 4: A-OKVQA images include natural scenes necessitating complex, multi-hop reasoning.

    Figure 5

    Figure 5: A-OKVQA textual data provide multiple rationales per question, enabling consensus evaluation.

Methodological Adaptations

Transitioning the Amazon CoT pipeline to open-domain, multimodal question answering required several adaptations to both data processing and model architecture:

  • Data Preprocessing: Non-ScienceQA datasets (ChartQA, OK-VQA, A-OKVQA) lack standardized JSON format and rationale/lecture fields. Custom harmonization scripts transform these sources into unified representations, inject placeholder fields where needed, and normalize answer formats to enable robust batch processing.
  • Prompt Engineering: Multiple-choice logic is suppressed for open-ended datasets. For ChartQA and OK-VQA, prompts elicit unrestricted rationales with numeric or categorical answer extraction. Regular expressions and normalization pipelines ensure reliable retrieval of final predictions.
  • Vision-Language Fusion: For each dataset, ViT-based image encoders generate feature vectors that are fused with textual tokens using a gated linear projection and concatenation, maintaining multimodal alignment.
  • Evaluation Metrics Expansion: Besides standard Exact Match (EM), numerical tolerance, consensus voting (for A-OKVQA/OK-VQA), and semantic similarity scores (using transformer embeddings) are employed to better reflect answer and rationale quality on diverse data formats.

Experimental Results

Performance degrades substantially when MM-CoT is transferred beyond ScienceQA. Empirical findings include:

Dataset Accuracy (%)
ChartQA 14.30
A-OKVQA 32.00
OK-VQA 21.31
ScienceQA 90.45
  • ChartQA (14.30% EM): The lowest accuracy observed, attributed to the symbolic, geometric nature of charts, the reliance on numeric inference (for which T5 is suboptimal), and the lack of multiple-choice prompts. ViT encoders struggle to adequately process fine-grained axes, fonts, and legends, and rationale quality is limited.
  • A-OKVQA (32.00% EM): Highest among the external datasets due to its similarity in rationale format to ScienceQA and presence of annotated rationales. Yet, vision encoder mismatches and diluted rationale grounding persist.
  • OK-VQA (21.31% EM): Sits between the two extremes, challenged by the need for external world knowledge integration absent from both image and text, confounded by ambiguous, open-ended question formats. Figure 6

    Figure 6: Comparative accuracy across ScienceQA, ChartQA, OK-VQA, and A-OKVQA in the Amazon MM-CoT framework.

Discussion and Implications

The dramatic drop in performance outside ScienceQA highlights the poor domain transferability of current MM-CoT architectures. Key deficiencies include:

  • Vision Encoder Generalization: ViT, trained primarily on natural scenes, is ill-suited to synthetic chart parsing and fails to capture fine-grained, geometry-dependent features.
  • Rationale Quality and Hallucination: Vision integration via gated fusion reduces hallucination in rationale generation, but quality varies markedly with task structure; verbose rationales frequently misalign with correct inferential paths in open-domain tasks.
  • Numeric and Commonsense Reasoning: T5-based architectures, even when multimodal, struggle with arithmetic and external knowledge synthesis absent from their training corpus. Lack of explicit retrieval or multihop reasoning exacerbates this failure mode.
  • Training Resource Constraints: Heavy computational requirements impede practical deployment and extensive fine-tuning on niche datasets, limiting generalization.

Theoretical and Practical Implications

The study identifies critical obstacles to effective cross-domain generalization in MM-CoT frameworks. The inability to adapt stepwise rationales and vision-language fusion mechanisms to open-ended, knowledge-driven, or numerically grounded domains reduces the framework's utility in broader, real-world reasoning contexts. There is limited evidence of transfer from structured science QA to commonsense, chart-based, or consensus-reasoning settings.

Future research should prioritize:

  • Incorporation of domain-adaptive vision encoders (e.g., chart-specific or knowledge-grounded models).
  • Integration with retrieval-augmented generation for open-domain reasoning.
  • Numeric and symbolic rationale pretraining or bootstrapping strategies for chart and table-based inference.
  • Efficient, resource-conserving fine-tuning protocols applicable to low-resource deployment scenarios.

Conclusion

Empirical analysis demonstrates that Multimodal Chain-of-Thought frameworks exhibit constrained cross-domain generalization. The Amazon CoT paradigm, when evaluated on ChartQA, OK-VQA, and A-OKVQA, falls markedly short of its ScienceQA performance, with system limitations traceable to vision encoder misfit, prompt and rationale misengineering, and resource bottlenecks. These findings articulate an urgent need for domain-specialized multimodal architectures and prompt engineering methodologies to bridge the gap between structured scientific reasoning and open-domain intelligence.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Glossary

  • AdamW: An optimization algorithm that decouples weight decay from the gradient update to improve generalization in transformer training. "Optimization is performed using AdamW with a linear learning-rate warmup schedule."
  • A-OKVQA: A benchmark for open-ended VQA that requires world knowledge and includes richer annotations and rationales. "A-OKVQA~\cite{schwenk2022okvqa} extends OK-VQA by providing richer annotations and multiple valid rationales."
  • ablation studies: Systematic removal or modification of model components to analyze their individual contributions. "Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices."
  • attention mask: A binary or weighted mask that controls which tokens the model’s attention mechanism can focus on. "A corresponding extension of the attention mask ensures that the decoder attends to both visual and textual tokens."
  • beam search: A heuristic decoding algorithm that explores multiple candidate sequences to select high-probability outputs. "captions are generated using beam search, and stored in a mapping question_id → caption."
  • Bilinear Attention Networks (BAN): A VQA model architecture that uses bilinear pooling to capture interactions between visual and textual features. "Methods like Bottom-Up Top-Down attention~\cite{anderson2018bottom}, Bilinear Attention Networks (BAN)~\cite{kim2018bilinear}, and Multi-modal Compact Bilinear pooling (MCB)~\cite{fukui2016multimodal} demonstrated the importance of fine-grained vision-language interaction."
  • BLIP: A vision-language pretraining and captioning model used to generate image captions for multimodal tasks. "we generate them using the BLIP image captioning model:"
  • Bottom-Up Top-Down attention: An attention mechanism combining region-level visual features (bottom-up) with task-driven signals (top-down). "Methods like Bottom-Up Top-Down attention~\cite{anderson2018bottom}, Bilinear Attention Networks (BAN)~\cite{kim2018bilinear}, and Multi-modal Compact Bilinear pooling (MCB)~\cite{fukui2016multimodal} demonstrated the importance of fine-grained vision-language interaction."
  • Chain-of-Thought (CoT) prompting: A prompting strategy that elicits intermediate reasoning steps to improve problem solving. "Chain-of-Thought (CoT) prompting emerged as a breakthrough technique for eliciting complex reasoning from LLMs."
  • ChartQA: A dataset for question answering over charts requiring numerical and visual reasoning over plots. "ChartQA~\cite{chartqa} consists of bar charts, pie charts, and line plots paired with natural-language questions requiring numerical comparisons, trend understanding, and reasoning over visualized data."
  • CLIP: A vision-LLM trained with contrastive objectives to align images and text. "Models like CLIP~\cite{radford2021learning}, ALIGN~\cite{jia2021scaling}, and BLIP~\cite{li2022blip} learn aligned vision-language representations through contrastive learning or image-text matching objectives."
  • compositional reasoning: Solving complex problems by decomposing them into simpler subproblems and composing solutions. "This hierarchical approach proves particularly effective for compositional reasoning tasks."
  • Consensus scoring: An evaluation method that awards partial credit based on agreement with multiple human annotations. "Consensus scoring: Uses OK-VQA's fractional VQA-style voting."
  • contrastive learning: A training paradigm that pulls aligned pairs (image-text) together and pushes non-matching pairs apart. "Models like CLIP~\cite{radford2021learning}, ALIGN~\cite{jia2021scaling}, and BLIP~\cite{li2022blip} learn aligned vision-language representations through contrastive learning or image-text matching objectives."
  • cosine decay: A learning rate schedule that decreases the rate following a cosine curve after a warmup period. "Cosine decay after warmup."
  • Exact Match (EM): A metric that checks if the normalized predicted answer exactly matches the normalized ground truth. "Exact Match (EM): String-normalized strict match."
  • F1-score: A harmonic mean of precision and recall computed over token-level overlaps between prediction and ground truth. "F1-score: Token-level overlap via precision/recall."
  • FLAN-T5-Base: A variant of the T5 model instruction-tuned to follow prompts, used as the backbone for reasoning. "We use the FLAN-T5-Base model as the backbone, consistent with the original paper."
  • gated attention mechanism: A fusion technique that modulates the contribution of image and text features via a learned gate during attention. "Image embeddings are fused with textual representations through the gated attention mechanism:"
  • gated fusion mechanism: A model component that integrates visual features with LLM representations using gates to control information flow. "integrates vision features through a gated fusion mechanism with T5-based LLMs."
  • gradient accumulation: A training technique that simulates larger batch sizes by summing gradients over multiple steps before updating weights. "Due to GPU memory constraints, we use a small batch size with gradient accumulation."
  • GPT-4V: A multimodal extension of GPT-4 that accepts visual inputs for integrated reasoning. "GPT-4V extends GPT-4's capabilities to visual inputs, while Gemini~\cite{reid2024gemini} provides native multimodal understanding."
  • Image-text matching objectives: Training losses that encourage correct pairing between images and their corresponding textual descriptions. "Models like CLIP~\cite{radford2021learning}, ALIGN~\cite{jia2021scaling}, and BLIP~\cite{li2022blip} learn aligned vision-language representations through contrastive learning or image-text matching objectives."
  • InstructBLIP: An instruction-tuned vision-LLM enabling controlled multimodal reasoning. "Instruction-Tuned Multimodal Models: LLaVA~\cite{liu2023visual}, InstructBLIP~\cite{dai2023instructblip}, and LLaMA-Adapter~\cite{zhang2023llama} fine-tune large vision-LLMs on instruction-following data, enabling more flexible and controllable multimodal reasoning."
  • instruction-tuned multimodal models: Models fine-tuned with instruction-following data to improve controllability and task alignment across modalities. "Instruction-Tuned Multimodal Models: LLaVA~\cite{liu2023visual}, InstructBLIP~\cite{dai2023instructblip}, and LLaMA-Adapter~\cite{zhang2023llama} fine-tune large vision-LLMs on instruction-following data, enabling more flexible and controllable multimodal reasoning."
  • learning-rate warmup schedule: A strategy that starts training with a low learning rate and gradually increases it to stabilize optimization. "Optimization is performed using AdamW with a linear learning-rate warmup schedule."
  • Least-to-Most Prompting: A prompting method that decomposes a problem into ordered subproblems solved sequentially. "Least-to-Most Prompting: Zhou et al.~\cite{zhou2022least} proposed decomposing complex problems into simpler sub-problems solved sequentially, with each solution building on previous results."
  • majority voting: A technique that selects the most frequent answer from multiple annotations or sampled predictions. "Algorithm: Majority Voting"
  • Multimodal Chain-of-Thought (Multimodal-CoT): A framework that generates explicit reasoning steps across both text and images. "Zhang et al.~\cite{zhang2023multimodal} introduced Multimodal-CoT, a framework that incorporates both language and vision modalities into a two-stage reasoning process."
  • Numeric Accuracy: A tolerance-based metric that checks if a predicted number is within a small margin of the ground truth. "Numeric Accuracy: A tolerance-based numeric comparison:"
  • OK-VQA: A VQA dataset requiring external commonsense and world knowledge beyond what is visible in the image. "OK-VQA~\cite{marino2019ok} requires answering open-ended questions about natural images where the answer is not present in the image alone."
  • ScienceQA: A structured, multiple-choice multimodal QA benchmark focused on scientific reasoning. "Their approach achieved state-of-the-art performance on the ScienceQA benchmark~\cite{lu2022learn}"
  • self-consistency decoding: A decoding strategy that samples multiple reasoning paths and aggregates answers to improve reliability. "Wang et al.~\cite{wang2022self} introduced self-consistency decoding, which samples multiple reasoning paths and selects the most consistent answer through majority voting."
  • Semantic Similarity: An evaluation measure that compares meaning similarity between prediction and ground truth, often using embeddings. "Semantic Similarity: Using SentenceTransformer embeddings:"
  • SentenceTransformer embeddings: Vector representations produced by SentenceTransformer models for computing semantic similarity. "Using SentenceTransformer embeddings:"
  • ViT-L/32: A Vision Transformer variant that processes images by splitting them into 32×32 patches and using a large model configuration. "Following the MM-CoT architecture, vision features extracted from a ViT-L/32 encoder are first projected into the T5 embedding dimension (768) using a trainable linear projection layer."
  • vision-language pretraining: Joint pretraining of models on paired image-text data to learn aligned multimodal representations. "Vision-Language Pretraining: Models like CLIP~\cite{radford2021learning}, ALIGN~\cite{jia2021scaling}, and BLIP~\cite{li2022blip} learn aligned vision-language representations through contrastive learning or image-text matching objectives."
  • visual entailment: A task that treats image-question answering as determining whether visual evidence supports or contradicts textual statements. "Visual Entailment~\cite{xie2019visual} frames VQA as a reasoning task."
  • Visual Question Answering (VQA): A task where models answer questions about images combining computer vision and natural language understanding. "Visual Question Answering (VQA) requires models to answer natural language questions about images."
  • Vision–Language Fusion: Architectural mechanisms for integrating visual features with textual representations for joint reasoning. "Vision–Language Fusion"
  • zero-shot CoT reasoning: Eliciting chain-of-thought explanations without task-specific examples by using simple prompting cues. "Zero-Shot CoT: Kojima et al.~\cite{kojima2022large} showed that simply appending phrases like \"Let's think step by step\" to questions enables zero-shot CoT reasoning"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.