MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Published 20 Jun 2025 in cs.CV, cs.AI, and cs.CL | (2506.17113v1)

Abstract: Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces MEXA, a novel training-free framework that dynamically selects and aggregates pre-trained experts for scalable multimodal reasoning.
It leverages a multimodal LLM router to choose modality-specific experts and uses a long-context reasoning model to integrate outputs for precise answers.
Experimental results demonstrate superior performance with accuracy gains in video, audio, 3D, and medical imaging benchmarks compared to state-of-the-art models.

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

The paper presents MEXA (Multimodal Expert Aggregator), a novel framework designed to address the challenges of scalable multimodal reasoning by aggregating multiple pre-trained expert models. It focuses on efficient reasoning across diverse domains such as video, audio, 3D, and medical imaging by introducing a training-free framework that dynamically selects and aggregates expert models based on input modalities and task-specific demands. Herein, MEXA is positioned as an innovative approach that leverages modular and interpretable reasoning without additional training overhead.

Framework Design

Dynamic Multi-Expert Aggregation

MEXA operates by dynamically coordinating a suite of specialized experts, each designed to handle distinct modality-task pairs. The approach involves:

Expert Selection: MEXA employs a Multimodal LLM (MLLM) as a router to select relevant expert models based on the input modality and reasoning demands. This enables the activation of only the most appropriate expert models for a given input, ensuring effective aggregation of outputs for multimodal reasoning.
Aggregation through Large Reasoning Models (LRM): Selected expert outputs are aggregated by an LRM that excels in long-context understanding and Complex Chain of Thought (CoT) reasoning. This step ensures the integration of complementary outputs from diverse experts to produce precise and contextually appropriate answers.
Figure 1: Overview of the MEXA Architecture. Given the input task context and question, MEXA first employs an MLLM router to select the appropriate experts based on input modality and required reasoning skills. The aggregator then reasons over the outputs from the selected experts to generate the final answer.

Expert Module Design Principles

MEXA organizes a diverse pool of expert models with two key design principles:

Task-Aware and Modality-Sensitive: The experts are categorized into perceptual, textual, structured, and mathematical types, with specific designs ensuring they can address mainstream multimodal challenges effectively. Expertise is further subdivided into fine-grained skills, enabling MEXA to support dynamic, task-specific adaptation.
Unified Textual Representation: All expert outputs are transformed into a shared textual format, facilitating seamless integration and reasoning by the LRM. This design choice enhances interpretability and the consistent handling of diverse modalities.

Experimental Evaluation

Benchmarks and Metrics

MEXA's efficacy is evaluated across multiple multimodal benchmarks: Video Reasoning (Video-MMMU), Audio QA (MMAU), 3D Understanding (SQA3D), and Medical QA (M3D). The framework consistently demonstrates superior performance over strong baselines, including state-of-the-art multimodal models.

Figure 2: Expert distributions selected by MEXA across different benchmarks, covering video (Video-MMMU), audio (MMAU), 3D (SQA3D), and medical imaging (M3D).

Quantitative Results

Experimental results show notable performance improvements:

Video-MMMU: MEXA achieved accuracy gains of up to 6% over GPT-4o and performed on par with human experts in complex video reasoning tasks.
MMAU: It significantly outperformed audio LLMs, demonstrating an average accuracy improvement of 15.6%.
SQA3D: Achieved robust situated reasoning with clear advantages over unified 3D models.
M3D: Demonstrated superior 3D medical image analysis capabilities with an improvement of 8.7% over GPT-4o.

The efficient expert-driven aggregation framework is validated through these benchmarks, showcasing its flexibility and robustness.

Conclusion and Implications

In conclusion, MEXA represents a significant advancement in multimodal AI by providing a highly flexible and interpretable framework that facilitates complex cross-modal reasoning without the need for expensive training. The plug-and-play nature of expert modules ensures scalability and adaptability to emerging modalities, indicating potential avenues for future research in general multimodal reasoning frameworks.

The adaptability of MEXA hints at its possible integration into real-world applications, providing a robust solution for tasks requiring complex interaction between varied data types. The use of pre-trained experts and state-of-the-art LLMs positions MEXA as a promising blueprint for future architectures seeking to generalize across increasingly complex multimodal landscapes.

Markdown Report Issue