BrepLLM Framework Overview

Updated 9 February 2026

BrepLLM is a framework that enables native parsing of complex Boundary Representation data by integrating geometric, topological, and linguistic modeling.
It employs adaptive UV sampling and hierarchical encoding with dual-tower contrastive pre-training to significantly improve 3D classification and captioning benchmarks.
Multi-stage LLM fine-tuning, featuring a Geometry-to-Vision bridge and Mixture-of-Query Experts, facilitates robust instruction tuning on the novel Brep2Text dataset.

Boundary Representation (Brep) models provide precise encoding of 3D geometry and topology in engineering and CAD, but the complexity and structure of Breps have made them challenging to integrate natively with LLMs. BrepLLM is the first framework to enable LLMs to parse and reason directly over raw Brep data, bridging the modality gap between structured 3D geometry and natural language via joint geometric, topological, and linguistic modeling. Leveraging a two-stage pipeline—cross-modal alignment pre-training and multi-stage LLM fine-tuning—BrepLLM achieves state-of-the-art performance on industrial 3D classification and captioning benchmarks, and establishes the first large-scale Brep instruction-tuning dataset (Deng et al., 18 Dec 2025).

1. Native Brep Graph Construction and Feature Representation

BrepLLM begins with an adaptive UV-sampling scheme that converts Boundary Representation data into a graph structure incorporating both geometry and topology. For a Brep with faces $\mathcal{S}$ and edges $\mathcal{C}$ , the process is as follows:

Graph Nodes and Edges: Each face $S \in \mathcal{S}$ becomes a node; adjacency is established by connecting nodes that share a boundary edge in $\mathcal{C}$ .
Adaptive UV Sampling: For each parametric face $S$ , sampling density $N_S$ is set by

$N_S = N_{\min}^{\mathrm{face}} + \frac{A_S - A_{\min}}{A_{\max}-A_{\min}} \bigl(N_{\max}^{\mathrm{face}}-N_{\min}^{\mathrm{face}}\bigr),$

where $A_S$ is the face area. Each sampled point $(u,v)$ is mapped to a 10D feature vector comprising 3D position $P$ , normal $\mathcal{C}$ 0, mean curvature $\mathcal{C}$ 1, visibility mask $\mathcal{C}$ 2, face type $\mathcal{C}$ 3, and normalized area $\mathcal{C}$ 4.

Edge Sampling: Each edge $\mathcal{C}$ 5 is sampled with density

$\mathcal{C}$ 6

yielding 8D point features: 3D position $\mathcal{C}$ 7, tangent $\mathcal{C}$ 8, edge type $\mathcal{C}$ 9, and normalized length $S \in \mathcal{S}$ 0.

This sampling ensures that both fine and coarse geometric structures are represented proportionally to their metric importance within the Brep.

2. Hierarchical BrepEncoder Architecture

The sampled Brep data are processed by a hierarchical BrepEncoder, comprising three parallel feature extraction branches for each face:

Fine-Grained Face Features: PointTransformerV3 is applied to per-face attributes to yield $S \in \mathcal{S}$ 1.
Edge-Conditioned Face Features: An Edge Encoder processes sampled edge attributes, propagating them onto incident faces via an NNConv mechanism, yielding $S \in \mathcal{S}$ 2.
Global Topology Features: 2D/1D CNNs for embedding faces/edges, followed by two Edge-conditioned Graph Attention (EGATConv) layers, produce $S \in \mathcal{S}$ 3.

All three branch outputs are concatenated to yield a per-face node token $S \in \mathcal{S}$ 4, and a global token $S \in \mathcal{S}$ 5 is produced by global-attention pooling. The result is a variable-length node token sequence and a single global token for downstream processing.

To align the structured Brep modality with natural language, BrepLLM employs dual-tower contrastive pre-training, analogous to CLIP:

Geometry Tower: The BrepEncoder's global token is projected to a $S \in \mathcal{S}$ 6-dimensional embedding ( $S \in \mathcal{S}$ 7).
Text Tower: A frozen CLIP text encoder (ViT-L/14) produces corresponding text embeddings ( $S \in \mathcal{S}$ 8).
InfoNCE Loss: Cosine similarity between geometry and text embeddings across a batch is computed and normalized, producing

$S \in \mathcal{S}$ 9

where $\mathcal{C}$ 0 are batchwise softmaxes over similarity scores.

This symmetric loss encourages matched Brep-text pairs to be close in embedding space, while separating mismatched pairs. The BrepEncoder is thereby trained to produce representations semantically aligned with natural language descriptions.

4. Multi-Stage LLM Fine-Tuning

Following cross-modal alignment, BrepLLM undergoes a three-tiered fine-tuning regimen to integrate Brep encoding with a text-generative LLM:

Stage I: Geometry-to-Vision Bridging

The frozen BrepEncoder generates node token sequences $\mathcal{C}$ 1, projected via a two-layer MLP to match the Q-Former embedding dimension.
A BLIP-2-style Q-Former with 32 learnable queries aggregates node embeddings, linearly projecting the output into the LLM's input space.
Only the projection MLP is trained; all other modules remain frozen.

Stage II: 3D–Language Alignment Fine-Tuning

LoRA adapters are applied to the projection MLP, selected Q-Former sublayers, and a subset of LLM layers; BrepEncoder remains frozen.
Standard autoregressive objective is used, further aligning the representation with text output.

Stage III: Mixture-of-Query Experts (MQE)

Introduces lightweight residual query experts and a sparse router that selects the top $\mathcal{C}$ 2 experts based on the aggregated node tokens.
The final Q-Former query set is given by $\mathcal{C}$ 3, with only the residual experts and router updated in training.
MQE is ablated to confirm optimal placement and effectiveness in Stage III.

This curriculum transfers vision–language priors and incorporates geometric diversity for robust Brep understanding and text generation.

5. Brep2Text Dataset for Instruction Tuning

BrepLLM is trained and evaluated on Brep2Text, the first large-scale Brep–language question–answer (QA) dataset:

Construction: Based on 134,722 industrial Brep models from Text2CAD, with two semantic question tiers automatically generated per model using Qwen-Max: high-level semantic and procedural modeling questions.
Scale: Yields 269,444 QA pairs in total, with 200 Breps and 400 QA pairs held out as a test set.
Quality Control: Automatic filtering for coherence and spot-checks for correctness.

Brep2Text enables direct instruction-tuning and evaluation of models on native Brep input rather than point-cloud or mesh surrogates.

6. Experimental Evaluation and Ablation

BrepLLM achieves state-of-the-art results on both object captioning and generative classification:

Metric	BrepLLM (Brep, 2.9B)	MiniGPT-3D (point cloud, 7–13B)	Improvement
Qwen-Max (captioning)	58.89	56.58	+2.31
SBERT similarity	73.05	71.64	+1.41
SimCSE similarity	74.46	73.13	+1.33
Human precision (caption)	81.85%	—	—
Classification avg (%)	57.05	54.90	+2.15

Ablations reveal:

Adaptive UV sampling yields +2.05% lift in Stage I, +0.64% end-to-end.
Hierarchical features add +2.42% to +2.87% accuracy.
Full three-stage fine-tuning curriculum is optimal (57.05% classification accuracy).
MQE's effectiveness is maximized when introduced only at Stage III.

BrepLLM's native processing of Brep data contrasts with prior point-cloud or mesh-based pipelines, enabling fine-grained geometric and topological reasoning unavailable to indirect surrogates. The introduction of hierarchical geometric–topological encoding, modality-aligned contrastive learning, multi-stage LLM integration, and the Brep2Text dataset marks the first instance of end-to-end instruction tuning for native Brep understanding, setting a new performance baseline in 3D CAD reasoning and captioning (Deng et al., 18 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

BrepLLM: Native Boundary Representation Understanding with Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BrepLLM Framework.