Papers
Topics
Authors
Recent
Search
2000 character limit reached

SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation

Published 31 Mar 2025 in cs.MM | (2503.24164v1)

Abstract: Large vision and LLMs show strong performance in tasks like image captioning, visual question answering, and retrieval. However, challenges remain in integrating speech, text, and vision into a unified model, especially for spoken tasks. Speech generation methods vary (some produce speech directly), others through text (but their impact on quality is unclear). Evaluation often relies on automatic speech recognition, which may introduce bias. We propose SVLA, a unified speech vision LLM based on a transformer architecture that handles multimodal inputs and outputs. We train it on 38.2 million speech text image examples, including 64.1 hours of synthetic speech. We also introduce Speech VQA Accuracy, a new metric for evaluating spoken responses. SVLA improves multimodal understanding and generation by better combining speech, vision, and language.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.