Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

Published 27 Mar 2025 in cs.CV | (2503.21770v1)

Abstract: This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.

Abstract PDF Upgrade to Chat

Summary

The paper defines Visual Jenga, a task to reveal object dependencies, and proposes a training-free method leveraging counterfactual inpainting asymmetry.
The method quantifies pairwise object dependencies by measuring the asymmetric difficulty of removing one object via inpainting while attempting to preserve the other.
The training-free method leverages pre-trained models to infer structural dependencies, offering practical insights for robotics, AR, and scene editing.

The paper "Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting" (2503.21770) introduces a novel scene understanding task aimed at uncovering the dependency structure between objects within a single static image. The core idea is analogous to the game Jenga: identifying which objects can be sequentially removed from a scene while maintaining its plausibility, thereby revealing underlying physical and semantic support relationships. This task moves beyond simple object detection or segmentation towards a deeper understanding of scene composition and inter-object relationships.

The Visual Jenga Task

The Visual Jenga task is formally defined as follows: given a single RGB image containing multiple objects, determine a valid sequence of object removals such that at each step, removing the selected object results in a physically and geometrically coherent scene configuration. The process continues until only the background remains. A successful execution of this task inherently requires reasoning about factors like occlusion, physical support (e.g., gravity), and semantic context (e.g., a monitor typically sits on a desk). Unlike traditional scene graphs that might represent spatial relationships (e.g., "above", "next to"), Visual Jenga aims to capture functional or structural dependencies – which objects rely on others for their presence or position within the scene's context. The output is an ordered list representing the removal sequence, implicitly encoding a dependency hierarchy.

Methodology: Counterfactual Inpainting and Asymmetry

The authors propose a training-free approach to address the Visual Jenga task, leveraging the capabilities of large pre-trained generative models, specifically inpainting models. The central hypothesis is that the dependency between two objects, A and B, exhibits asymmetry when considering their removal. If object A depends on object B (e.g., A is sitting on B), then removing A and inpainting the resulting void might be relatively straightforward for a powerful inpainting model, resulting in a plausible scene where B remains. However, removing object B and attempting to inpaint the void while keeping A might be significantly harder, potentially leading to incoherent or physically implausible results (e.g., A floating in mid-air). This difference in inpainting difficulty, or the quality of the counterfactual scene generated, quantifies the dependency asymmetry.

The proposed method involves the following steps:

Object Segmentation: First, an off-the-shelf instance segmentation model (e.g., SAM) is used to identify and delineate all distinct object masks $\{O_1, O_2, ..., O_n\}$ in the input image $I$ .
Pairwise Counterfactual Generation: For every ordered pair of objects $(O_i, O_j)$ , a counterfactual image is generated. To assess the dependency of $O_i$ on $O_j$ , object $O_i$ is removed (masked out) from the image, and an inpainting model is employed to fill the void, conditioned on the remaining image content (including $O_j$ ). Let the resulting inpainted image be $I'_{i \setminus j}$ (denoting $O_i$ removed, $O_j$ present).
Asymmetry Quantification: The core idea is to measure the "cost" or "difficulty" of removing $O_i$ given $O_j$ is present, versus removing $O_j$ given $O_i$ is present. This cost is evaluated based on the quality or plausibility of the generated counterfactual images $I'_{i \setminus j}$ and $I'_{j \setminus i}$ . A dependency score, $S(O_i \rightarrow O_j)$ , is computed, representing how much $O_i$ depends on $O_j$ . A high score suggests $O_i$ strongly depends on $O_j$ (i.e., removing $O_j$ makes reconstructing the scene without $O_i$ difficult, or removing $O_i$ is easy). The exact scoring function can vary, but it should capture the asymmetry. For example, it could be based on the realism of the inpainted region, the consistency of the object $O_j$ after inpainting the removal of $O_i$ , or the difference in reconstruction error/likelihood provided by the inpainting model.
Dependency Graph Construction: The pairwise scores $S(O_i \rightarrow O_j)$ are used to construct a directed graph where nodes represent objects and edges represent dependencies. An edge from $O_i$ to $O_j$ with weight $S(O_i \rightarrow O_j)$ indicates the degree to which $O_i$ depends on $O_j$ .
Removal Sequence Determination: Based on the dependency graph, a valid removal sequence is determined. Objects with low outgoing dependency scores (or high incoming scores, depending on the score definition) are candidates for earlier removal. Intuitively, objects that do not support other objects, or are heavily supported themselves, should be removed first. The algorithm iteratively selects the object that is "least depended upon" by the remaining objects, removes it, and updates the dependencies until all objects are removed. This can be framed as finding a topological sort or a variation thereof on the dependency graph.

The authors emphasize the effectiveness of this simple, data-driven approach without requiring task-specific training, relying solely on the implicit knowledge captured within large pre-trained segmentation and inpainting models.

Implementation Details

Implementing this approach requires careful consideration of several components:

Segmentation Model: The quality of the initial object segmentation is critical. Models like the Segment Anything Model (SAM) provide a strong foundation, but errors in segmentation (missed objects, merged objects, inaccurate boundaries) will propagate through the pipeline. Fine-tuning SAM or using alternative panoptic/instance segmentation models might be necessary depending on the domain.
Inpainting Model: A high-resolution, context-aware inpainting model is essential. Diffusion-based models (e.g., Stable Diffusion with inpainting capabilities, LaMa) are suitable candidates. The choice of model impacts the quality of the counterfactuals and computational cost. The model must be capable of generating plausible content for potentially large masked regions corresponding to removed objects.
Scoring Function: Defining the dependency score $S(O_i \rightarrow O_j)$ $S (O_{i} \to O_{j})$ is key. Potential implementations include:
- Inpainting Realism Score: Using a discriminator model (e.g., from a GAN) or a perceptual metric (e.g., LPIPS) to evaluate the realism of the inpainted region in $I'_{i \setminus j}$ . A less realistic patch when removing $O_j$ compared to removing $O_i$ indicates $O_i$ depends on $O_j$ .
- CLIP Score Consistency: Evaluating the CLIP similarity between the original image crop of $O_j$ and the corresponding region in the inpainted image $I'_{i \setminus j}$ . A significant drop in similarity suggests the inpainting process struggled to maintain the consistency of $O_j$ when $O_i$ was removed.
- Inpainting Likelihood/Error: If the inpainting model provides a likelihood or reconstruction error, this could directly quantify the difficulty. Higher error when removing $O_j$ (trying to inpaint while preserving $O_i$ ) than when removing $O_i$ implies $O_i$ depends on $O_j$ .
- The paper suggests using the asymmetry in pairwise relationships, implying a comparison like $Cost(Remove(O_i) | O_j) - Cost(Remove(O_j) | O_i)$ .
Sequence Generation Algorithm: A simple greedy approach can work:
1. Compute all pairwise dependency scores $S(O_i \rightarrow O_j)$ .
2. Calculate the total dependency on each object $k$ : $D_{on}(O_k) = \sum_{i \neq k} S(O_i \rightarrow O_k)$ .
3. Calculate the total dependency of each object $k$ : $D_{of}(O_k) = \sum_{j \neq k} S(O_k \rightarrow O_j)$ .
4. Select the object $O^*$ with the minimum $D_{of}(O_k)$ (or maximum $D_{on}(O_k)$ , depending on score definition and desired interpretation - minimum "support provided" seems intuitive for removal).
5. Add $O^*$ to the removal sequence.
6. Remove $O^*$ and its associated edges from the graph/score calculation.
7. Repeat steps 4-6 until all objects are removed.

Below is pseudocode illustrating the core pairwise scoring logic:

function calculate_dependency_score(image, mask_i, mask_j, segmenter, inpainter, scorer):
  """Calculates the dependency score S(Oi -> Oj)."""

  # Counterfactual 1: Remove Oi, keep Oj
  inpainted_image_no_i = inpainter.inpaint(image, mask_i)
  cost_no_i = scorer.evaluate(original_image=image,
                              inpainted_image=inpainted_image_no_i,
                              removed_mask=mask_i,
                              preserved_mask=mask_j) # Score how well Oj is preserved / how plausible the result is

  # Counterfactual 2: Remove Oj, keep Oi
  inpainted_image_no_j = inpainter.inpaint(image, mask_j)
  cost_no_j = scorer.evaluate(original_image=image,
                              inpainted_image=inpainted_image_no_j,
                              removed_mask=mask_j,
                              preserved_mask=mask_i) # Score how well Oi is preserved / how plausible the result is

  # Asymmetry: Higher score means Oi depends more on Oj
  # This definition assumes lower 'cost' is better (e.g., lower reconstruction error, higher realism)
  dependency_score = cost_no_j - cost_no_i

  return dependency_score

dependency_matrix = {}
objects = segmenter.get_objects(image) # List of masks
for oi in objects:
  for oj in objects:
    if oi == oj: continue
    score = calculate_dependency_score(image, oi.mask, oj.mask, segmenter, inpainter, scorer)
    dependency_matrix[(oi.id, oj.id)] = score

removal_sequence = determine_removal_sequence(dependency_matrix, objects)

Practical Considerations

Computational Cost: The primary bottleneck is the repeated use of the inpainting model. For $n$ objects, $O(n^2)$ pairs exist, requiring $O(n^2)$ inpainting operations. This can be computationally intensive for scenes with many objects.
Model Dependency: The performance heavily relies on the capabilities of the chosen segmentation and inpainting models. Failure modes of these models (e.g., inability to segment correctly, unrealistic inpainting) directly impact the resulting dependency structure. The approach implicitly assumes the inpainting model possesses common-sense physical and geometric understanding.
Ambiguity and Subjectivity: Scene interpretation can be subjective. The notion of "dependency" might be ambiguous (e.g., semantic vs. physical support). The results reflect the biases and knowledge encoded within the pre-trained models.
Limitations: The method might struggle with complex non-pairwise interactions, transparent/reflective objects, or highly cluttered scenes where segmentation is challenging. The definition of "coherence" is implicitly defined by the scoring function and the inpainting model's capabilities. It primarily captures pairwise relationships.
Applications: This task and methodology could be valuable for robotics (understanding object manipulation affordances), augmented reality (realistic object removal/insertion), scene editing, and improving generative model controllability by explicitly modeling structural constraints. Understanding object dependencies is crucial for reasoning about scene stability and potential interaction outcomes.

Conclusion

The "Visual Jenga" paper introduces an intriguing task for probing the structural understanding of scenes by sequentially removing objects based on inferred dependencies. The proposed training-free approach, leveraging counterfactual generation via inpainting and quantifying dependency through asymmetry, offers a practical method to estimate these relationships without task-specific annotations or training. While reliant on the performance of underlying large models and computationally intensive, it provides a novel direction for analyzing scene composition beyond standard recognition tasks, focusing instead on the functional and physical relationships between objects.

Markdown Report Issue