Object Removal Metrics

Updated 17 January 2026

Object removal metrics are quantitative evaluation tools that measure the success of algorithms in completely erasing target objects while maintaining scene plausibility.
They integrate class- or region-aware, reference-free, deep feature, and spatiotemporal approaches to overcome the limitations of traditional image inpainting benchmarks.
Applications span image, video, and 3D scene editing, using metrics like FID⋆, ReMOVE, and TokSim to align evaluation outcomes with human perceptual judgments.

Object removal metrics quantitatively assess the efficacy of algorithms that erase specified objects from visual data, whether images, videos, or 3D scenes, while preserving fidelity, continuity, and plausibility of the resulting scene. Traditional image inpainting benchmarks inadequately address this task due to distinctions between mere plausible infilling and the stringent requirements of object erasure—specifically, the need to suppress all object-specific features and to avoid semantic or geometric artifacts. State-of-the-art object removal metrics now target these challenges with reference-free, class- or region-aware, and human-aligned designs, operating over a continuum of data modalities and erasure goals.

1. Limitations of Conventional Evaluation Approaches

Traditional evaluation of object removal commonly relies on full-reference (FR) image quality measures (PSNR, SSIM, LPIPS, P-IDS), which require comparison to an unmodified “original” image. However, this paradigm is fundamentally unsuitable for object removal tasks because the original contains the very object that should disappear. As a result:

Small-mask removals are incorrectly favored: FR metrics remain artificially high when little is changed (object incompletely erased), and penalize large, necessary erasures as being “too different” from the original.
Unpaired metrics (e.g., FID, U-IDS) fail if the reference set contains the object class in question; imperfect removals remain close to this “real” set distribution, reducing discriminative power.
Reference-free image quality assessment (IQA) techniques trained on “normal” images also struggle to distinguish failed from successful removals, since their priors often include the target object class.
Empirical results demonstrate that FR or standard unpaired metrics systematically disagree with both ground-truth-removal-based evaluation and human perceptual judgments, especially in challenging real-world and simulated scenarios (Oh et al., 2024).

2. Class-wise Object Removal Metrics

A recent advance involves class-wise, unpaired metrics that leverage semantic filtering of datasets to form genuinely object-free reference distributions. Let $X_c$ denote the set of inpainted (object-removed) images for class $c$ , and $X_{\neg c}$ denote natural images containing no instance of $c$ :

FID⋆ (“FID star”): Computes Fréchet distance between distributions of deep features from $X_c$ and $X_{\neg c}$ . Explicitly,

$\mathrm{FID}^*(X_{\neg c}, X_c) = \|\mu_{\neg c} - \mu_c\|_2^2 + \mathrm{Tr}(\Sigma_{\neg c} + \Sigma_c - 2 (\Sigma_{\neg c} \Sigma_c)^{1/2})$

Higher scores suggest insufficient erasure; low values indicate generated images are indistinguishable from truly object-free scenes.

U-IDS⋆ (“Unpaired IDS star”): Measures the SVM-derived decision boundary between deep features from object-free and object-removed sets. A lower U-IDS⋆ implies that generated removals are statistically closer to the natural object-free reference (Oh et al., 2024).

These metrics require only segmentation annotations to build the mask and filter reference samples—no ground-truth edited images are needed. Empirical evaluations on CARLA (with ground-truth-removal images) and COCO (with human annotators) demonstrate that FID⋆ and U-IDS⋆ select the same optimum mask sizes as ground-truth metrics and human votes, respectively, whereas standard metrics routinely misrank these (Oh et al., 2024).

3. Reference-Free Deep Feature Metrics

Reference-free, patch-based deep feature comparisons have emerged to address scenarios where ground-truth removals are unavailable. The ReMOVE metric is prototypical:

Extract patch embeddings inside and outside the mask using a segmentation-trained Vision Transformer (ViT, e.g., SAM-backbone).
Compute the cosine similarity between the mean masked-region embedding and the mean background-region embedding:

$\mathrm{ReMOVE}(x, M) = \frac{\bar z_m \cdot \bar z_u}{\|\bar z_m\| \|\bar z_u\|}$

where $\bar z_m$ , $\bar z_u$ are the mean embeddings over masked and unmasked patches, respectively.

ReMOVE robustly distinguishes between complete erasure, replacement-with-other-objects, and poor background synthesis. It correlates with LPIPS in ground-truth benchmarks and aligns more closely with human preferences than other no-reference metrics (e.g., CLIPScore), particularly due to its penalization of semantic mismatches and its capacity to highlight undesirable object insertions (Chandrasekar et al., 2024).

4. Metrics for Context Coherence and Hallucination

To further dissect removal performance, the Context-Aware Feature Deviation (CFD) metric evaluates both (a) context blending and (b) object hallucination:

Context term: Measures how closely the deep features (DINOv2) of the inpainted region align with those of the immediate bounding-box context.
Hallucination penalty: Identifies “object-like” segments within the inpainted region using SAM-ViT-H. For each nested detected segment, compares features with spatially adjacent overlapping background segments. The final score sums a context difference term and an area-weighted hallucination penalty:

$\mathrm{CFD}(I', M) = d_{\text{context}} + d_{\text{hall}}$

Lower CFD values imply higher plausibility and fewer perceptible artifacts.

CFD is reference-free, only requiring the inpainted image and mask. Benchmarks indicate superior alignment with human judgments of seamlessness and hallucination absence compared to pixelwise and previous feature-based metrics (Yu et al., 11 Mar 2025).

5. Video and Spatiotemporal Object-Removal Metrics

Video object-removal introduces a temporal dimension: clean erasure must be temporally stable, spatially coherent, and distinguish objects from effects (e.g., shadows, reflections). TokSim is a patch-token–level metric for videos:

Temporal consistency: Consistency of foreground token embeddings across consecutive frames indicates stability.
Dissimilarity: Output foreground tokens must diverge from their pre-removal inputs, evidencing actual erasure.
Spatial coherence: Output foreground tokens should blend with their spatially adjacent background tokens.

Each patch’s contribution is the product of these three terms. TokSim averages this over all foreground patches and frame pairs, yielding a score that tracks human judgments of clean, temporally-coherent object removal (Kushwaha et al., 10 Jan 2026).

6. 3D Object-Removal Metrics and Residual Quantification

3D scene editing, especially for privacy, requires confirming the absence of semantic and geometric “residuals” of removed objects. Metrics include:

IoU_drop: Decrease in segmentation mask overlap (IoU) for the object class before and after removal; gauges whether segmentation algorithms still detect the (supposedly) erased object.
Recognition accuracy ( $acc_{\text{seg}}$ ): Percentage of views where the segmentation does not recognize the object.
SAM-mask similarity ( $sim_{\text{SAM}}$ ): Change in instance segmentation regions (via prompt-free SAM) between pre- and post-removal renderings.
Spatial-depth consistency ( $acc_{\Delta \text{depth}}$ ): Fraction of object pixels exhibiting significant depth change; safeguards against geometric residuals (Kocour et al., 21 Mar 2025).

User studies confirm that these metrics reliably predict perceptual detection of object remnants, validating their utility in privacy-aware workflows and system certification.

7. Comparative Performance, Reliability, and Best Practices

Experimental results reveal that novel, class-aware and reference-free metrics (FID⋆, U-IDS⋆, ReMOVE, CFD, TokSim) outperform standard image quality and unpaired generation metrics on both simulated and real-world datasets. Their decisions are robust to dataset size (RSD <1% above certain thresholds), require no ground-truth removals, and exhibit strong alignment with human selection of “best” removals (Oh et al., 2024, Chandrasekar et al., 2024, Yu et al., 11 Mar 2025, Kushwaha et al., 10 Jan 2026, Kocour et al., 21 Mar 2025).

For precise deployment:

Prefer class/region-aware comparison sets filtered by segmentation labels.
For small masks, apply cropping to balance patch statistics.
Use feature extractors pretrained for segmentation (e.g., ViT–SAM, DINOv2) to intensify sensitivity to object remnants.
Combine multiple metrics (semantic, instance, geometric) to robustly guard against residuals and artifacts, especially critical in privacy contexts.

These modern object removal metrics facilitate reproducible, reference-free benchmarking, prioritize human-aligned perceptual qualities, and natively distinguish erasure from substitution, establishing rigorous quantitative standards for object-oriented image, video, and 3D-editing research.