BrepLLM Framework Overview
- BrepLLM is a framework that enables native parsing of complex Boundary Representation data by integrating geometric, topological, and linguistic modeling.
- It employs adaptive UV sampling and hierarchical encoding with dual-tower contrastive pre-training to significantly improve 3D classification and captioning benchmarks.
- Multi-stage LLM fine-tuning, featuring a Geometry-to-Vision bridge and Mixture-of-Query Experts, facilitates robust instruction tuning on the novel Brep2Text dataset.
Boundary Representation (Brep) models provide precise encoding of 3D geometry and topology in engineering and CAD, but the complexity and structure of Breps have made them challenging to integrate natively with LLMs. BrepLLM is the first framework to enable LLMs to parse and reason directly over raw Brep data, bridging the modality gap between structured 3D geometry and natural language via joint geometric, topological, and linguistic modeling. Leveraging a two-stage pipeline—cross-modal alignment pre-training and multi-stage LLM fine-tuning—BrepLLM achieves state-of-the-art performance on industrial 3D classification and captioning benchmarks, and establishes the first large-scale Brep instruction-tuning dataset (Deng et al., 18 Dec 2025).
1. Native Brep Graph Construction and Feature Representation
BrepLLM begins with an adaptive UV-sampling scheme that converts Boundary Representation data into a graph structure incorporating both geometry and topology. For a Brep with faces and edges , the process is as follows:
- Graph Nodes and Edges: Each face becomes a node; adjacency is established by connecting nodes that share a boundary edge in .
- Adaptive UV Sampling: For each parametric face , sampling density is set by
where is the face area. Each sampled point is mapped to a 10D feature vector comprising 3D position , normal , mean curvature , visibility mask , face type , and normalized area .
- Edge Sampling: Each edge is sampled with density
yielding 8D point features: 3D position , tangent , edge type , and normalized length .
This sampling ensures that both fine and coarse geometric structures are represented proportionally to their metric importance within the Brep.
2. Hierarchical BrepEncoder Architecture
The sampled Brep data are processed by a hierarchical BrepEncoder, comprising three parallel feature extraction branches for each face:
- Fine-Grained Face Features: PointTransformerV3 is applied to per-face attributes to yield .
- Edge-Conditioned Face Features: An Edge Encoder processes sampled edge attributes, propagating them onto incident faces via an NNConv mechanism, yielding .
- Global Topology Features: 2D/1D CNNs for embedding faces/edges, followed by two Edge-conditioned Graph Attention (EGATConv) layers, produce .
All three branch outputs are concatenated to yield a per-face node token , and a global token is produced by global-attention pooling. The result is a variable-length node token sequence and a single global token for downstream processing.
3. Cross-Modal Pre-training with Contrastive Alignment
To align the structured Brep modality with natural language, BrepLLM employs dual-tower contrastive pre-training, analogous to CLIP:
- Geometry Tower: The BrepEncoder's global token is projected to a -dimensional embedding ().
- Text Tower: A frozen CLIP text encoder (ViT-L/14) produces corresponding text embeddings ().
- InfoNCE Loss: Cosine similarity between geometry and text embeddings across a batch is computed and normalized, producing
where are batchwise softmaxes over similarity scores.
This symmetric loss encourages matched Brep-text pairs to be close in embedding space, while separating mismatched pairs. The BrepEncoder is thereby trained to produce representations semantically aligned with natural language descriptions.
4. Multi-Stage LLM Fine-Tuning
Following cross-modal alignment, BrepLLM undergoes a three-tiered fine-tuning regimen to integrate Brep encoding with a text-generative LLM:
Stage I: Geometry-to-Vision Bridging
- The frozen BrepEncoder generates node token sequences , projected via a two-layer MLP to match the Q-Former embedding dimension.
- A BLIP-2-style Q-Former with 32 learnable queries aggregates node embeddings, linearly projecting the output into the LLM's input space.
- Only the projection MLP is trained; all other modules remain frozen.
Stage II: 3D–Language Alignment Fine-Tuning
- LoRA adapters are applied to the projection MLP, selected Q-Former sublayers, and a subset of LLM layers; BrepEncoder remains frozen.
- Standard autoregressive objective is used, further aligning the representation with text output.
Stage III: Mixture-of-Query Experts (MQE)
- Introduces lightweight residual query experts and a sparse router that selects the top experts based on the aggregated node tokens.
- The final Q-Former query set is given by , with only the residual experts and router updated in training.
- MQE is ablated to confirm optimal placement and effectiveness in Stage III.
This curriculum transfers vision–language priors and incorporates geometric diversity for robust Brep understanding and text generation.
5. Brep2Text Dataset for Instruction Tuning
BrepLLM is trained and evaluated on Brep2Text, the first large-scale Brep–language question–answer (QA) dataset:
- Construction: Based on 134,722 industrial Brep models from Text2CAD, with two semantic question tiers automatically generated per model using Qwen-Max: high-level semantic and procedural modeling questions.
- Scale: Yields 269,444 QA pairs in total, with 200 Breps and 400 QA pairs held out as a test set.
- Quality Control: Automatic filtering for coherence and spot-checks for correctness.
Brep2Text enables direct instruction-tuning and evaluation of models on native Brep input rather than point-cloud or mesh surrogates.
6. Experimental Evaluation and Ablation
BrepLLM achieves state-of-the-art results on both object captioning and generative classification:
| Metric | BrepLLM (Brep, 2.9B) | MiniGPT-3D (point cloud, 7–13B) | Improvement |
|---|---|---|---|
| Qwen-Max (captioning) | 58.89 | 56.58 | +2.31 |
| SBERT similarity | 73.05 | 71.64 | +1.41 |
| SimCSE similarity | 74.46 | 73.13 | +1.33 |
| Human precision (caption) | 81.85% | — | — |
| Classification avg (%) | 57.05 | 54.90 | +2.15 |
Ablations reveal:
- Adaptive UV sampling yields +2.05% lift in Stage I, +0.64% end-to-end.
- Hierarchical features add +2.42% to +2.87% accuracy.
- Full three-stage fine-tuning curriculum is optimal (57.05% classification accuracy).
- MQE's effectiveness is maximized when introduced only at Stage III.
7. Distinction from Related Work and Significance
BrepLLM's native processing of Brep data contrasts with prior point-cloud or mesh-based pipelines, enabling fine-grained geometric and topological reasoning unavailable to indirect surrogates. The introduction of hierarchical geometric–topological encoding, modality-aligned contrastive learning, multi-stage LLM integration, and the Brep2Text dataset marks the first instance of end-to-end instruction tuning for native Brep understanding, setting a new performance baseline in 3D CAD reasoning and captioning (Deng et al., 18 Dec 2025).