Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding

Published 14 Dec 2025 in cs.CV and cs.AI | (2512.12822v1)

Abstract: Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a unified model that eliminates modality-specific encoders by fusing 3D patches and language tokens into a single transformer.
It employs a progressive three-stage curriculum—from object recognition to scene-level reasoning—achieving state-of-the-art spatial QA improvements.
The unified approach enhances parameter efficiency and computational speed, making it ideal for practical 3D spatial applications and robotics.

Lemon: A Unified and Scalable Architecture for 3D Multimodal Spatial Intelligence

Introduction

The Lemon model presents a unified approach for scalable multimodal spatial understanding using 3D point cloud data and natural language. Conventional 3D large multimodal models (LMMs) rely on fragmented pipelines with separate modality-specific encoders followed by cross-modal alignment. These approaches are limited by both representational bottlenecks and restricted generalization due to the scale and heterogeneity of 3D data representations. Lemon distinguishes itself as the first architecture that eliminates modality-specific encoders by embedding both 3D point cloud patches and language tokens into a shared representation processed by a single transformer. This paradigm shift enables seamless token-level spatial-linguistic fusion, significant parameter efficiency, and superior model scaling characteristics.

Unified Model Architecture and Patchification Strategy

Lemon’s architecture processes 3D spatial data and textual data within a single transformer model. Point clouds are hierarchically partitioned into patches using a Z→Y→X recursive strategy, creating balanced and standardized spatial segments within the data. These 3D patches are then mapped to the model’s embedding space via a learnable linear projector and interleaved with specialized modality and separator tokens (e.g., <pointcloud>, <layer_sep>, <row_sep>, <point_patch>) to explicitly reflect spatial hierarchy. This tokenization ensures structural locality and efficient 3D-to-language alignment, and the spatial token order matches natural scene hierarchies frequently encountered in indoor settings, enhancing compatibility with LLMs’ modeling biases.

Unlike existing models that use external 3D encoders (for example, PointNet++ or ReCon++) with frozen or separately trained parameters, Lemon's design directly projects raw point cloud data to token embeddings, enabling fully integrated, end-to-end optimization. Ablation studies demonstrate that incorporating external encoders not only fails to improve performance but may also introduce training instability and representational bottlenecks, validating the architectural decision to forgo separate 3D backbones.

Progressive Training Curriculum and Optimization

Lemon optimizes 3D spatial intelligence using a three-stage curriculum, systematically building capacity from primitive geometric understanding to complex scene reasoning:

Object-Level Recognition: Large-scale pretraining on diverse 3D object datasets (e.g., Objaverse) establishes foundational spatial and categorical knowledge, aligning specialized 3D tokens with semantic labels.
Object Captioning and Grounding: The model advances to natural language articulation of 3D objects using curated datasets with rich textual annotations and grounding cues (e.g., Cap3D, GAPartNet).
Scene-Level Spatial Reasoning: Instruction-tuning on complex spatial QA datasets (e.g., 3D-GRAND) enables scene-centric reasoning, including spatial relation estimation, embodied interaction (e.g., navigability analysis), and collision assessment.

Empirical ablations on curriculum ordering and mixing reveal that progressive object-then-scene phase-disentangled curriculum consistently outperforms joint or mixed-task training regimens, with object-level pretraining proving critical for downstream scene reasoning performance.

Performance and Scaling Behavior

Lemon establishes new state-of-the-art results across a comprehensive 3D multimodal evaluation suite:

Object Recognition and Captioning: Lemon outperforms all prior point-cloud-based LMMs and achieves parity with closed-source 2D state-of-the-art models such as GPT-4V on semantic correspondence and open-ended captioning (evaluated using LLM-as-judge and modern embedding-based similarity metrics), while exhibiting stronger robustness on challenging, noisy, or sparse 3D data.
Scene-Level Reasoning: On spatial QA tasks from 3D-GRAND, Lemon exhibits up to 8.9% improvement in binary accuracy and 7.7% gain in generative QA over competing baselines. Its fine-grained spatial comprehension reduces hallucinations typical in 2D VLMs operating on multi-view projections, highlighting the benefit of direct 3D access.
Visual Grounding: Without access to large-scale grounding data during pretraining, Lemon nevertheless matches specialized visual grounding models on ScanRefer and maintains competitive localization ([email protected]) relative to domain-specific baselines.
Scalability: Analysis of scaling laws with increased model and data size reveals predictable, power-law improvements in task performance, substantiating the suitability of Lemon’s unified architectural paradigm for scaling to larger datasets and models as 3D data availability grows.

Architectural and Practical Implications

Lemon’s encoder-free design streamlines the pipeline, reducing both parameter redundancy and inference latency. Its computational efficiency results from minimizing visual preprocessing—dynamic hierarchical patchification and local Farthest Point Sampling are the primary overhead, with most compute concentrated in the transformer backbone. Compared to competing approaches that require loading large external 3D encoders alongside the LLM, Lemon’s unified parameterization delivers superior performance per FLOP and parameter, which is critical for practical deployment in resource-constrained settings (e.g., robotic agents).

The model’s direct 3D tokenization strategy also ensures that spatial structure is preserved without the discretization or geometric information loss often observed in projection-based or naively patchified systems. Its robustness under density variation and sensor-imposed noise positions Lemon as a generalist 3D LMM suitable for real-world spatial AI and robotics applications.

Limitations and Future Directions

The principal limitations remain the inherent scarcity and diversity gap of large-scale paired 3D-language datasets compared to 2D, and the computational expense of large-scale 3D model training and inference. Patchification may introduce discretization artifacts that affect fine-grained tasks. Future research directions suggested by the authors include the development of more scalable 3D dataset collection mechanisms, further architectural refinement for ultra-fine grounding capabilities, cross-modal alignment enhancement, and integration for closed-loop embodied robotics.

Conclusion

Lemon represents a major technical step towards unified, scalable 3D multimodal learning. By tightly integrating structured 3D spatial data with language at the token level in a single transformer and adopting a principled progressive curriculum, Lemon achieves significant advances over fragmented architecturally heterogeneous baselines. The strong empirical performance, parameter efficiency, and robust scaling properties make it a solid foundation for further advances in 3D spatial intelligence, opening new opportunities for embodied AI, scene-centric reasoning, robotics, and practical multimodal systems.

Citation: "Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding" (2512.12822)

Markdown Report Issue