Tooth-Level Point Cloud Modeling

Updated 2 January 2026

Tooth-level point cloud representation is defined as the encoding of 3D dental geometries using features such as coordinates, normals, and curvature to accurately model individual teeth and dentitions.
It employs advanced encoder-decoder architectures—including PointNet, Transformers, and prototype memory modules—to capture both local details and global dental structures.
Benchmark studies reveal improved reconstruction fidelity and segmentation accuracy, demonstrating its potential to advance digital orthodontics and clinical analysis.

Tooth-level point cloud representation refers to the mathematical and algorithmic handling of 3D point sets that describe the geometry of individual teeth or entire dentitions, supporting downstream tasks such as segmentation, classification, completion, landmark detection, arrangement, and generative modeling. Contemporary research has established a series of encoding strategies, neural architectures, evaluation metrics, and learning paradigms tailored to the anatomical regularities and clinical requirements of dental data. The following sections synthesize recent advances, highlighting canonical approaches, their motivation, technical specifics, and their performance characteristics.

1. Fundamental Representation Formats

3D tooth-level point clouds are commonly constructed from surface meshes obtained by optical intraoral scanning, cone-beam CT, or panoramic X-ray-based inference. Standard preprocessing includes normalization (to zero-mean, unit scale), random sampling to fixed point count (typically 512–16,000 per tooth or dentition), and sometimes orientation alignment (e.g., x-axis for neighbor alignment, y-axis for occlusal direction). Per-point features range from minimalist 3D coordinates $\mathbf{x}\in\mathbb{R}^3$ (Sun et al., 3 Dec 2025), to augmented descriptors incorporating surface normals, barycenters, local curvature, and manual or learned features. For instance, each mesh cell in (Jana et al., 2023) is reduced to a $(\mathbf{c},\mathbf{n})\in\mathbb{R}^6$ tuple (cell barycenter and averaged unit normal), supporting efficient, permutation-invariant input for geometric learning pipelines.

The table below summarizes prevalent raw point attribute choices:

Representation	Attributes Per Point	Reference
Coordinates only	$[x, y, z]$	(Sun et al., 3 Dec 2025)
Coord + Normal	$[x, y, z, n_x, n_y, n_z]$	(Jana et al., 2023)
Extended Geometry	$[v_1,v_2,v_3,b,n_1,n_2,n_3,n_b]$	(Jana et al., 2022)

A consistent finding is that explicit inclusion of per-point normals—and, where available, local curvature—substantially improves downstream learning, particularly for boundary and fine-structure recognition (Xiong et al., 2023, Jana et al., 2023).

2. Architectural Approaches: Encoders, Decoders, and Self-Organization

Encoder–decoder paradigms dominate 3D dental modeling, with PointNet, MLP-based, and Transformer backbones extracting global or per-point embeddings. In tooth completion and generative tasks, the encoder compresses the input point cloud $\mathcal{P}_{pi}\in\mathbb{R}^{N\times 3}$ into a global descriptor $\mathcal{F}_{pi}\in\mathbb{R}^d$ , often after farthest-point sampling to enforce coverage (Sun et al., 3 Dec 2025, Ye et al., 2023). Decoder modules reconstruct complete or filled point clouds from these representations, with FoldingNet-style architectures mapping from 2D grid parameters to 3D predictions via concatenated latent codes (Sun et al., 3 Dec 2025, Ye et al., 2023).

Notably, prototype memory architectures (Sun et al., 3 Dec 2025) introduce a bank of $K$ learnable vectors $\mathcal{M}=\{m_k\in\mathbb{R}^d\}$ acting as canonical shape priors. During inference, nearest neighbor search in feature space retrieves the most similar prototype to the encoded partial (or ground-truth) tooth, and a confidence-gated fusion (via $\alpha=\sigma(\mathrm{C}([\mathcal{F}_{pi}\|\;m_{i^*}))$ ) blends the query with this prior, resulting in a fused descriptor $f'$ . This de-biases incomplete shape representations and enables the decoder to focus on local morphology (cusps, ridges, interproximal details), with the memory bank itself self-organizing into anatomically plausible modes without explicit label supervision.

Variational latent models such as VF-Net (Ye et al., 2023) explicitly model the distribution over shapes via a probabilistic encoder–decoder, supporting interpolation, sampling, and manipulation in latent space (e.g., traversing a "wear direction"). Unlike Chamfer-only objectives, these methods provide a normalized likelihood for generative modeling, with each input point mapped to a unique output via auxiliary embeddings $g_i\in[-1,1]^2$ , yielding state-of-the-art distributional and point-wise reconstruction fidelity.

3. Segmentation and Instance-Level Representations

Segmentation methodologies partition tooth-level point clouds either semantically (labeling every point as belonging to a tooth type or gingiva) or at the instance level (assigning points to distinct teeth, accommodating variable counts, missing, or malposed teeth). Dual-branch fusion networks (e.g., PointMLP + CurveNet) operate with complementary geometry and curve features: geometry branches capture local shape via shared MLPs and affine normalization, while curve branches aggregate surface patterns through local point walks and convolutional propagation (Jana et al., 2023).

Transformer-based frameworks such as TSegFormer (Xiong et al., 2023) model long-range dependencies across the full dental arch, with attention layers capturing global spatial context. Enriched point-wise features (coordinates, normals, Gaussian, and novel "point curvature" $m_i$ estimates) are deployed, with multi-task heads supporting semantic and auxiliary boundary predictions. Instance-aware frameworks (BATISNet, (Cai et al., 30 Dec 2025)) use a proposal-free, query-driven segmentation: a bank of learnable queries interacts with backbone features (via dot-products or learned convolutions), generating soft masks for $M$ candidate teeth. Post-processing includes Hungarian matching for optimal prediction–ground-truth alignment and graph-cut smoothing to enforce consistent and anatomically plausible tooth boundaries.

Boundary-focused losses, often focal-style and curvature-weighted, sharpen separations where point distribution alone is ambiguous due to packed dentition or occlusion (Cai et al., 30 Dec 2025, Xiong et al., 2023).

4. Landmark, Axis, and Structural Field Encoding

For orthodontic applications requiring detection of anatomical landmarks (cusps, contact points) and axes (buccal, mesial-distal), dense field representations encode these attributes over every point of an isolated tooth cloud (Wei et al., 2021). Landmarks are modeled as geodesic-heat fields: $D^{(j)}(p_i)=\exp(-G(p_i,f_{pj})^2/(2\sigma^2))$ , distributing supervision smoothly beyond sparse anchor points. Tooth axes are expressed as per-point vectors pointing from a surface point to its orthogonal projection onto the target axis (i.e., $v_\alpha(p_i)=\text{proj}_\alpha(p_i)-p_i$ ), and networks are trained to regress both types, leveraging multi-scale attention and non-local context. This field-based regression mitigates ambiguity in smooth areas and adapts gracefully to inter-individual anatomical variability.

5. Arrangement, Collision, and Pose Decoupling

Tooth-level point cloud representations are employed in digital orthodontic arrangement tasks, where the input is a set $\{P_l\}$ of isolated tooth clouds, and the goal is to infer physically plausible, non-colliding poses reflecting clinical objectives. The DTAN framework (He et al., 2024) separates geometric (shape) and positional features within each hidden representation: $\phi^{Geo}$ encodes centered, shape-only point clouds, while $\phi^{Pos}$ processes absolute configurations. Transformer modules propagate dependencies, and prediction of motion parameters $(\overline t_l,\overline q_l)$ is accomplished via MLPs operating on fused codes. Differentiable collision loss is implemented by sampling 2D grids on inter-tooth planes and quantifying overlap as local penetrations, with well-defined subgradients for backpropagation, ensuring feasibility under clinical constraints. Feature-consistency (contrastive) losses further encourage stability across perturbed or multiple reference frames, a design directly transferable to general multi-part and articulated object tasks.

6. Performance Benchmarking and Implications

Tooth-level point cloud methodologies are consistently benchmarked using metrics such as Chamfer distance, Earth Mover's Distance (EMD), Intersection over Union (IoU, after voxelization), Dice score, and instance/semantic segmentation accuracy. Prototype-augmented completion with Mem4Teeth (Sun et al., 3 Dec 2025) demonstrates a $15.7\%$ reduction in Chamfer Distance over strong baselines, and probabilistic VF-Net (Ye et al., 2023) achieves state-of-the-art reconstruction (Chamfer: $1.21\times 10^{-2}$ ) and classification (SVM: $96.80\%$ ). Segmentation approaches using simplified representations (barycenter and normal, 6-D input) retain or even surpass accuracy of structurally richer (24-D) graph-based methods (Jana et al., 2023).

Single-instance and few-shot learning regimes highlight strong geometric priors and the feasibility of self-supervised or minimally supervised pipelines, with single-scan segmentation scoring as high as DSC $\approx 0.82$ (MeshSegNet+gco) compared to full-data performance of $0.94$ (Jana et al., 2022). This suggests substantial opportunities for data-scarce clinical settings.

7. Open Problems and Future Work

While prototype memory and dense field encoding have advanced robustness, challenges remain in generalizing across highly pathologic morphologies, managing cross-modal fusion (e.g., 2D–3D in panoramic X-ray-to-point-cloud pipelines (Ma et al., 2024)), and integrating anatomical consistency (joint modeling of teeth, gingiva, arch shape, and functional occlusion). Extensions may include hybrid implicit–explicit representations for continuous surface reconstruction, contrastive and cross-modality pre-training for enhanced transferability, and differentiable physics-informed constraints for broader biomechanical simulations.

Overall, tooth-level point cloud representation is an active subfield combining advancements in geometric machine learning, statistical modeling, and clinical dental informatics, with ongoing innovation in data-efficient, anatomically faithful, and functionally aware methods.