ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

Published 2 Jun 2025 in cs.CV | (2506.01853v1)

Abstract: Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal LLMs. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D LLM capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: https://github.com/JAMESYJL/ShapeLLM-Omni

Abstract PDF Upgrade to Chat

Summary

The paper introduces a native multimodal LLM that employs a 3D vector-quantized VAE to efficiently represent and reconstruct 3D objects.
It shows that ShapeLLM-Omni outperforms existing models in text-to-3D and image-to-3D tasks while maintaining strong linguistic abilities.
The model paves the way for practical applications in robotics, digital twins, and interactive 3D content creation, setting a foundation for future research.

Overview of ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

The paper presents ShapeLLM-Omni, a native LLM designed to understand and generate 3D assets alongside textual content, thereby bridging an existing gap in multimodal LLM capabilities. Unlike previous models such as GPT-4o, which are restricted to text and image modalities, ShapeLLM-Omni extends these capabilities to include a 3D modality, thus fostering advancements in areas such as robotics, digital twins, and virtual environments.

Methodology

ShapeLLM-Omni's architecture incorporates a 3D vector-quantized variational autoencoder (VQVAE) to map 3D shapes into a discrete latent space. This allows efficient representation and reconstruction of 3D objects, analogous to language modeling. The authors constructed a comprehensive training dataset, 3D-Alpaca, which includes tasks such as 3D generation, understanding, and editing. By integrating 3D-aware discrete tokens, ShapeLLM-Omni can effectively leverage a next-token prediction paradigm for tasks across different modalities.

Data and Training

The model's backbone is built on Qwen-2.5-VL-Instruct-7B, a pre-trained multimodal LLM equipped with image-understanding abilities whose visual encoder remains fixed during training. The corpus utilized for training ShapeLLM-Omni comprises 3.46 billion tokens, covering various tasks such as text-to-3D generation and image-to-3D generation. In addition to structured 3D data, the UltraChat text-only dataset is incorporated to preserve the model’s comprehensive conversational capabilities.

Experimental Findings

Quantitative evaluations demonstrate ShapeLLM-Omni's effectiveness in extending LLM capabilities to 3D content while preserving linguistic skills. On tasks of 3D generation, ShapeLLM-Omni outperformed SAR3D, CRM, and 3DTopia-XL, closely matching Trellis in image-to-3D generation despite architectural differences. In text-to-3D generation, it showed superior semantic alignment with reference images generated from input text prompts. The model maintained strong performance on linguistic metrics such as SIQA, PIQA, and MMLU—comparable to other leading multimodal LLMs.

Limitations and Future Work

The 3D-editing subset of the dataset is relatively sparse, highlighting the need for more comprehensive editing data to improve this aspect of ShapeLLM-Omni's capabilities. Additionally, the current instantiation, with its 7 billion parameters, pales in comparison to larger models necessary for achieving levels akin to GPT-4o in multimodal tasks. Future iterations could expand the model’s parameter space and integrate more diverse datasets for enhanced training.

Implications

The introduction of ShapeLLM-Omni is a significant stride toward a unified multimodal LLM capable of handling complex 3D data. The potential applications span various practical domains, including interactive 3D content creation, user-guided asset design, and enhanced spatial reasoning for robotics. Furthermore, this work lays a foundational platform for subsequent research dedicated to refining 3D-native capabilities within AI models, heralding a new era of sophisticated multimodal interactions.