Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting
Abstract: This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.
Summary
- The paper defines Visual Jenga, a task to reveal object dependencies, and proposes a training-free method leveraging counterfactual inpainting asymmetry.
- The method quantifies pairwise object dependencies by measuring the asymmetric difficulty of removing one object via inpainting while attempting to preserve the other.
- The training-free method leverages pre-trained models to infer structural dependencies, offering practical insights for robotics, AR, and scene editing.
The paper "Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting" (2503.21770) introduces a novel scene understanding task aimed at uncovering the dependency structure between objects within a single static image. The core idea is analogous to the game Jenga: identifying which objects can be sequentially removed from a scene while maintaining its plausibility, thereby revealing underlying physical and semantic support relationships. This task moves beyond simple object detection or segmentation towards a deeper understanding of scene composition and inter-object relationships.
The Visual Jenga Task
The Visual Jenga task is formally defined as follows: given a single RGB image containing multiple objects, determine a valid sequence of object removals such that at each step, removing the selected object results in a physically and geometrically coherent scene configuration. The process continues until only the background remains. A successful execution of this task inherently requires reasoning about factors like occlusion, physical support (e.g., gravity), and semantic context (e.g., a monitor typically sits on a desk). Unlike traditional scene graphs that might represent spatial relationships (e.g., "above", "next to"), Visual Jenga aims to capture functional or structural dependencies – which objects rely on others for their presence or position within the scene's context. The output is an ordered list representing the removal sequence, implicitly encoding a dependency hierarchy.
Methodology: Counterfactual Inpainting and Asymmetry
The authors propose a training-free approach to address the Visual Jenga task, leveraging the capabilities of large pre-trained generative models, specifically inpainting models. The central hypothesis is that the dependency between two objects, A and B, exhibits asymmetry when considering their removal. If object A depends on object B (e.g., A is sitting on B), then removing A and inpainting the resulting void might be relatively straightforward for a powerful inpainting model, resulting in a plausible scene where B remains. However, removing object B and attempting to inpaint the void while keeping A might be significantly harder, potentially leading to incoherent or physically implausible results (e.g., A floating in mid-air). This difference in inpainting difficulty, or the quality of the counterfactual scene generated, quantifies the dependency asymmetry.
The proposed method involves the following steps:
- Object Segmentation: First, an off-the-shelf instance segmentation model (e.g., SAM) is used to identify and delineate all distinct object masks {O1​,O2​,...,On​} in the input image I.
- Pairwise Counterfactual Generation: For every ordered pair of objects (Oi​,Oj​), a counterfactual image is generated. To assess the dependency of Oi​ on Oj​, object Oi​ is removed (masked out) from the image, and an inpainting model is employed to fill the void, conditioned on the remaining image content (including Oj​). Let the resulting inpainted image be Ii∖j′​ (denoting Oi​ removed, Oj​ present).
- Asymmetry Quantification: The core idea is to measure the "cost" or "difficulty" of removing Oi​ given Oj​ is present, versus removing Oj​ given Oi​ is present. This cost is evaluated based on the quality or plausibility of the generated counterfactual images Ii∖j′​ and Ij∖i′​. A dependency score, S(Oi​→Oj​), is computed, representing how much Oi​ depends on Oj​. A high score suggests Oi​ strongly depends on Oj​ (i.e., removing Oj​ makes reconstructing the scene without Oi​ difficult, or removing Oi​ is easy). The exact scoring function can vary, but it should capture the asymmetry. For example, it could be based on the realism of the inpainted region, the consistency of the object Oj​ after inpainting the removal of Oi​, or the difference in reconstruction error/likelihood provided by the inpainting model.
- Dependency Graph Construction: The pairwise scores S(Oi​→Oj​) are used to construct a directed graph where nodes represent objects and edges represent dependencies. An edge from Oi​ to Oj​ with weight S(Oi​→Oj​) indicates the degree to which Oi​ depends on Oj​.
- Removal Sequence Determination: Based on the dependency graph, a valid removal sequence is determined. Objects with low outgoing dependency scores (or high incoming scores, depending on the score definition) are candidates for earlier removal. Intuitively, objects that do not support other objects, or are heavily supported themselves, should be removed first. The algorithm iteratively selects the object that is "least depended upon" by the remaining objects, removes it, and updates the dependencies until all objects are removed. This can be framed as finding a topological sort or a variation thereof on the dependency graph.
The authors emphasize the effectiveness of this simple, data-driven approach without requiring task-specific training, relying solely on the implicit knowledge captured within large pre-trained segmentation and inpainting models.
Implementation Details
Implementing this approach requires careful consideration of several components:
- Segmentation Model: The quality of the initial object segmentation is critical. Models like the Segment Anything Model (SAM) provide a strong foundation, but errors in segmentation (missed objects, merged objects, inaccurate boundaries) will propagate through the pipeline. Fine-tuning SAM or using alternative panoptic/instance segmentation models might be necessary depending on the domain.
- Inpainting Model: A high-resolution, context-aware inpainting model is essential. Diffusion-based models (e.g., Stable Diffusion with inpainting capabilities, LaMa) are suitable candidates. The choice of model impacts the quality of the counterfactuals and computational cost. The model must be capable of generating plausible content for potentially large masked regions corresponding to removed objects.
- Scoring Function: Defining the dependency score S(Oi​→Oj​) is key. Potential implementations include:
- Inpainting Realism Score: Using a discriminator model (e.g., from a GAN) or a perceptual metric (e.g., LPIPS) to evaluate the realism of the inpainted region in Ii∖j′​. A less realistic patch when removing Oj​ compared to removing Oi​ indicates Oi​ depends on Oj​.
- CLIP Score Consistency: Evaluating the CLIP similarity between the original image crop of Oj​ and the corresponding region in the inpainted image Ii∖j′​. A significant drop in similarity suggests the inpainting process struggled to maintain the consistency of Oj​ when Oi​ was removed.
- Inpainting Likelihood/Error: If the inpainting model provides a likelihood or reconstruction error, this could directly quantify the difficulty. Higher error when removing Oj​ (trying to inpaint while preserving Oi​) than when removing Oi​ implies Oi​ depends on Oj​.
- The paper suggests using the asymmetry in pairwise relationships, implying a comparison like Cost(Remove(Oi​)∣Oj​)−Cost(Remove(Oj​)∣Oi​).
- Sequence Generation Algorithm: A simple greedy approach can work:
- Compute all pairwise dependency scores S(Oi​→Oj​).
- Calculate the total dependency on each object k: Don​(Ok​)=iî€ =k∑​S(Oi​→Ok​).
- Calculate the total dependency of each object k: Dof​(Ok​)=jî€ =k∑​S(Ok​→Oj​).
- Select the object O∗ with the minimum Dof​(Ok​) (or maximum Don​(Ok​), depending on score definition and desired interpretation - minimum "support provided" seems intuitive for removal).
- Add O∗ to the removal sequence.
- Remove O∗ and its associated edges from the graph/score calculation.
- Repeat steps 4-6 until all objects are removed.
Below is pseudocode illustrating the core pairwise scoring logic:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
function calculate_dependency_score(image, mask_i, mask_j, segmenter, inpainter, scorer): """Calculates the dependency score S(Oi -> Oj).""" # Counterfactual 1: Remove Oi, keep Oj inpainted_image_no_i = inpainter.inpaint(image, mask_i) cost_no_i = scorer.evaluate(original_image=image, inpainted_image=inpainted_image_no_i, removed_mask=mask_i, preserved_mask=mask_j) # Score how well Oj is preserved / how plausible the result is # Counterfactual 2: Remove Oj, keep Oi inpainted_image_no_j = inpainter.inpaint(image, mask_j) cost_no_j = scorer.evaluate(original_image=image, inpainted_image=inpainted_image_no_j, removed_mask=mask_j, preserved_mask=mask_i) # Score how well Oi is preserved / how plausible the result is # Asymmetry: Higher score means Oi depends more on Oj # This definition assumes lower 'cost' is better (e.g., lower reconstruction error, higher realism) dependency_score = cost_no_j - cost_no_i return dependency_score dependency_matrix = {} objects = segmenter.get_objects(image) # List of masks for oi in objects: for oj in objects: if oi == oj: continue score = calculate_dependency_score(image, oi.mask, oj.mask, segmenter, inpainter, scorer) dependency_matrix[(oi.id, oj.id)] = score removal_sequence = determine_removal_sequence(dependency_matrix, objects) |
Practical Considerations
Computational Cost: The primary bottleneck is the repeated use of the inpainting model. For n objects, O(n2) pairs exist, requiring O(n2) inpainting operations. This can be computationally intensive for scenes with many objects.
- Model Dependency: The performance heavily relies on the capabilities of the chosen segmentation and inpainting models. Failure modes of these models (e.g., inability to segment correctly, unrealistic inpainting) directly impact the resulting dependency structure. The approach implicitly assumes the inpainting model possesses common-sense physical and geometric understanding.
- Ambiguity and Subjectivity: Scene interpretation can be subjective. The notion of "dependency" might be ambiguous (e.g., semantic vs. physical support). The results reflect the biases and knowledge encoded within the pre-trained models.
- Limitations: The method might struggle with complex non-pairwise interactions, transparent/reflective objects, or highly cluttered scenes where segmentation is challenging. The definition of "coherence" is implicitly defined by the scoring function and the inpainting model's capabilities. It primarily captures pairwise relationships.
- Applications: This task and methodology could be valuable for robotics (understanding object manipulation affordances), augmented reality (realistic object removal/insertion), scene editing, and improving generative model controllability by explicitly modeling structural constraints. Understanding object dependencies is crucial for reasoning about scene stability and potential interaction outcomes.
Conclusion
The "Visual Jenga" paper introduces an intriguing task for probing the structural understanding of scenes by sequentially removing objects based on inferred dependencies. The proposed training-free approach, leveraging counterfactual generation via inpainting and quantifying dependency through asymmetry, offers a practical method to estimate these relationships without task-specific annotations or training. While reliant on the performance of underlying large models and computationally intensive, it provides a novel direction for analyzing scene composition beyond standard recognition tasks, focusing instead on the functional and physical relationships between objects.
Paper to Video (Beta)
No one has generated a video about this paper yet.
Whiteboard
No one has generated a whiteboard explanation for this paper yet.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Open Problems
We haven't generated a list of open problems mentioned in this paper yet.
Continue Learning
- How does the reliance on pre-trained inpainting and segmentation models affect the generalizability of the Visual Jenga approach to different domains (e.g., medical imaging or satellite imagery)?
- What are the primary failure cases observed when the method is used on images with complex occlusions, transparent objects, or fine-grained object interactions?
- In what ways could the computational cost (O(n^2) inpainting operations) be mitigated without significantly impacting the accuracy of the dependency estimates?
- How might object dependency estimation be improved by combining counterfactual inpainting with explicit 3D reasoning or physics-based simulation?
- Find recent papers about functional and physical object relationship inference in images.
Related Papers
- Seeing the Unseen: Visual Common Sense for Semantic Placement (2024)
- Foreground-aware Image Inpainting (2019)
- ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion (2024)
- Shape-guided Object Inpainting (2022)
- Paint by Inpaint: Learning to Add Image Objects by Removing Them First (2024)
- An Empty Room is All We Want: Automatic Defurnishing of Indoor Panoramas (2024)
- URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images (2024)
- 3D StreetUnveiler with Semantic-aware 2DGS -- a simple baseline (2024)
- LLM-enhanced Scene Graph Learning for Household Rearrangement (2024)
- Defurnishing with X-Ray Vision: Joint Removal of Furniture from Panoramas and Mesh (2025)
Authors (3)
Collections
Sign up for free to add this paper to one or more collections.
Tweets
Sign up for free to view the 3 tweets with 320 likes about this paper.