Amodal Ground Truth Masks

Updated 10 February 2026

Amodal ground truth masks are binary segmentation masks that capture the entire object silhouette, including hidden regions behind occluders.
Generation pipelines use 3D rendering, simulation, and compositing methods to produce large-scale, objective annotations with reduced subjectivity.
Their application enhances amodal segmentation, robotics grasp planning, and vision‐language reasoning by accurately inferring occluded object structures.

Amodal ground truth masks are binary segmentation masks that delineate the full silhouette of an object in an image, including both the visible and occluded regions. Unlike modal segmentation, which restricts the mask to the observed (non-occluded) parts, amodal masks provide the “complete” 2D projection of the object as if no other object is obscuring it. These masks are central to the task of amodal segmentation, which targets robust scene understanding under occlusion by requiring models to predict not only the visible region but also infer the extents of hidden parts. The creation and use of amodal ground truth masks is a key methodological and benchmarking challenge in computer vision and robotics, and recent work has emphasized pipelines for generating authentic, objective, and large-scale amodal annotations, often involving simulation, 3D geometry, or compositing approaches (Zhan et al., 2023).

1. Formal Definition and Geometric Construction

Amodal ground truth masks are formally defined at the pixel level. Let $O_i$ denote an object instance in a scene and let $\mathcal{I}$ be the corresponding RGB image. The modal mask $M_i$ indicates the set of pixels where $O_i$ is visible in $\mathcal{I}$ ; the amodal mask $A_i$ contains the full 2D projected region of $O_i$ , as it would appear if other objects were absent.

Geometrically, given a set of per-object 3D meshes $\{O_1,\dots,O_n\}$ and calibrated camera parameters $(K, [R|t])$ , one obtains the modal mask by rendering all objects with z-buffering (depth-based visibility), and amodal masks by rendering each mesh individually with z-buffering disabled:

$M_i = \Phi_{K,[R|t]}(O_1 \cup \cdots \cup O_n)[\mathrm{ID}=i]$
$A_i = \Phi_{K,[R|t]}(O_i)$

The occluded region $F_i$ is the set difference, $F_i = A_i \setminus M_i$ . Only instances meeting a minimum occlusion criterion, such as $|A_i| > 1.2 \cdot |M_i|$ , are retained to ensure significant occlusion is present. This approach yields a set of amodal masks that are objective (tied to 3D shape and camera geometry) rather than subjectively drawn (Zhan et al., 2023, Moore et al., 1 Jul 2025).

2. Automatic Generation Pipelines

State-of-the-art amodal mask datasets leverage scene reconstruction or synthetic pipelines for large-scale and accurate annotation.

3D-based generation: In the “MP3D-Amodal” benchmark, 3D meshes from MatterPort3D, with per-image camera calibration, allow deterministic rendering of modal and amodal masks using multi-pass rasterization. After geometric mask generation, a two-stage manual QA step is applied for curation, with crowd annotators and expert filters ensuring mask fidelity, leading to over 12,000 high-quality amodal masks derived from real images but grounded in physical 3D structure (Zhan et al., 2023).

Simulation/synthetic occlusion: Large-scale synthetic datasets such as MOVi-MC-AC utilize physically simulated scenes, dropping dozens of objects into a virtual space and rendering per-object amodal (no-occlusion render) and modal (full scene with z-buffer) masks automatically. This guarantees that the amodal ground truth corresponds exactly to the full object silhouette in each view, free from annotator subjectivity and scale limitations. Millions of object instances, along with amodal RGB content and optional depth, are provided in this format (Moore et al., 1 Jul 2025).

Compositing-based video protocols: For temporal segmentation, new benchmarks such as TABE-51 record two videos at matched viewpoints (“clean object” with no occluder, “occluder only”), extract segmentation masks using a strong pre-trained model, and alpha-composite to form occlusion sequences. The amodal mask for any frame is the known unoccluded mask mapped according to object motion, guaranteeing pixel-exact amodal supervision even under full occlusion (Hudson et al., 2024).

Synthetic image occlusion: Other workflows (e.g., Amodal-LVIS and pix2gestalt) yield paired occluded/unoccluded crops composited from a “complete object bank,” ensuring that for every occluded object, there exists an aligned, unoccluded amodal mask. This design is used to efficiently synthesize diverse occlusion configurations at scale (Tai et al., 8 Mar 2025, Ozguroglu et al., 2024).

3. Human Annotation and Verification Protocols

Early amodal segmentation benchmarks (COCOA, KINS, D2S amodal, COCOA-cls) relied on human tracings for the “full” object outline, with annotators instructed to mark both visible (modal) and invisible regions, usually in the form of polygons or per-pixel RLE masks (Follmann et al., 2018). This produces fine-grained amodal/visible/invisible mask triplets per instance. Human annotation provides significant coverage across diverse categories and natural scenes, but mask quality is limited by annotator consistency and subjectivity, especially in cases of severe or ambiguous occlusion.

To ensure geometric accuracy, some pipelines combine human QA with initial automatic (3D, synthetic, or compositional) annotation. For example, in MP3D-Amodal, a two-stage verification process first filters masks by majority crowd “Yes/No” responses to the question of completeness and correctness, then expert annotators prune remaining errors. This results in high inter-annotator agreement (IoU > 0.95 between human redraws and automatic projection), supporting the claim of “authentic” amodal ground truth (Zhan et al., 2023).

4. Representation, File Formats, and Annotation Data Structures

Amodal ground truth masks are typically distributed as:

Pixelwise binary PNG masks, one per object instance per image.
Polygon or run-length encoded (RLE) segmentations in COCO-style JSONs, storing amodal, visible, and invisible masks as separate fields.
Index maps encoding instance-IDs (for large-scale simulation datasets), with amodal and modal instance segmentation encoded as labeled arrays (Moore et al., 1 Jul 2025, Caramia et al., 10 Dec 2025).
In some video or multi-view contexts, per-object amodal masks are provided for every frame and camera, indexed by (scene, camera, frame, objectID).

Amodal mask datasets often accompany each binary mask with:

Object class and instance ID.
Amodal and visible bounding box.
Optional auxiliary annotations: occlusion relations, layer order, appearance crops, depth.

5. Evaluation Metrics for Amodal Mask Quality

Standard evaluation criteria for amodal mask prediction include:

Intersection-over-Union (IoU): For predicted amodal mask $\widehat{A}$ and ground truth $A^{gt}$ , IoU $= |A^{gt} \cap \widehat{A}|\,/\,|A^{gt} \cup \widehat{A}|$ .
Occluded-region IoU: Intersection-over-Union computed only in the occluded subset, i.e., $|(\widehat{A}\setminus M)\cap(A^{gt}\setminus M)|\,/\,|(\widehat{A}\setminus M)\cup(A^{gt}\setminus M)|$ for ground-truth modal mask $M$ .
Mean AP: COCO-style average precision for amodal detection and segmentation, typically across IoU thresholds.
Per-instance Amodal/Visible/Invisible AP: Precision-recall computed for each mask type, with some benchmarks requiring successful prediction of both visible and amodal masks simultaneously for a true positive (Follmann et al., 2018).
Boundary- and region-based metrics: Mean IoU, boundary IoU, and cumulative/global IoU are used for comparing mask edge accuracy and for fusing multiple predictions, respectively (Moore et al., 1 Jul 2025, Shih et al., 2 Jun 2025).
Human agreement: When possible, human-drawn amodal contours are compared to automatic projections. An average IoU above 0.95 is reported for MP3D-Amodal (Zhan et al., 2023).

6. Applications and Significance in Downstream Tasks

Amodal ground truth masks provide supervision for models tasked with amodal instance segmentation, amodal completion (inpainting), and reasoning about occluded scene geometry.

Amodal Segmentation and Completion: Models trained on authentic amodal ground truth outperform baselines that rely on pseudo-labels or purely modal data synthesis, especially on occluded portions (up to 20–30% mIoU improvement in heavily occluded regions on MOVi-MC-AC) (Moore et al., 1 Jul 2025).
Robotics and Grasping: In robotics, particularly for bin-picking and manipulation in clutter, access to amodal masks improves grasp planning and occluded-object retrieval efficiency. Integration into policy learning (e.g., OPG-Policy) enables targeted actions for objects only partially visible (Ding et al., 6 Mar 2025, Caramia et al., 10 Dec 2025).
Vision-Language Reasoning: New benchmarks (R2SM) test vision-LLMs on the dual task of mask-type selection (modal vs. amodal) from intent-rich instructions, relying on ground truth that is precisely paired with natural-language prompts and both types of mask (Shih et al., 2 Jun 2025).
Robust Scene Understanding: Large-scale amodal mask datasets are foundational for advancing generalization to unseen occlusions, multi-view/video completion, and intent-aware segmentation pipelines.

7. Limitations, Quality Control, and Future Directions

Traditional subjectivity in human-drawn amodal masks has limited reproducibility and consistency, especially in the presence of severe or ambiguous occlusions. Recent pipelines reduce subjectivity by using simulation, multi-view 3D modeling, or compositing approaches, validated by crowd and expert annotators or masked by human agreement metrics (Zhan et al., 2023, Hudson et al., 2024, Moore et al., 1 Jul 2025).

Nevertheless, controlling mask quality at object boundaries, under fine texture, and for deformable or highly articulated objects remains challenging. Some pipelines impose strict geometric checks such as modal-amodal consistency ( $|A_i\cap M_i\,\Delta\,M_i| \approx 0$ across the dataset) and require that trivial or barely occluded cases are filtered out by occlusion-ratio thresholds.

Current research is expanding the scope of amodal mask benchmarking into multi-camera, multi-instance, vision-language, and real-time robotic contexts, enabled by exhaustively accurate, large-scale, and diverse amodal ground truth masking methodologies across synthetic and real domains.

References:

"Amodal Ground Truth and Completion in the Wild" (Zhan et al., 2023)
"Training for X-Ray Vision" (Moore et al., 1 Jul 2025)
"Track Anything Behind Everything" (Hudson et al., 2024)
"Segment Anything, Even Occluded" (Tai et al., 8 Mar 2025)
"Learning to See the Invisible: End-to-End Trainable Amodal Instance Segmentation" (Follmann et al., 2018)
"Amodal Intra-class Instance Segmentation: Synthetic Datasets and Benchmark" (Ao et al., 2023)
"ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics" (Caramia et al., 10 Dec 2025)
"R2SM: Referring and Reasoning for Selective Masks" (Shih et al., 2 Jun 2025)