ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields

Published 31 Jan 2024 in cs.CV, cs.AI, and cs.GR | (2401.17895v1)

Abstract: We introduce ReplaceAnything3D model (RAM3D), a novel text-guided 3D scene editing method that enables the replacement of specific objects within a scene. Given multi-view images of a scene, a text prompt describing the object to replace, and a text prompt describing the new object, our Erase-and-Replace approach can effectively swap objects in the scene with newly generated content while maintaining 3D consistency across multiple viewpoints. We demonstrate the versatility of ReplaceAnything3D by applying it to various realistic 3D scenes, showcasing results of modified foreground objects that are well-integrated with the rest of the scene without affecting its overall integrity.

Abstract PDF HTML Upgrade to Chat

References (57)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces RAM3D, which leverages text prompts for targeted object detection, removal, and replacement in 3D scenes.
It employs a multi-stage erase-and-replace method combining text-guided inpainting and neural radiance fields to maintain consistent views.
RAM3D demonstrates robustness across varied scene types and supports custom asset integration, marking a significant advancement in 3D scene editing.

Introduction

One of the burgeoning challenges in the 3D content creation domain is the ability to edit and manipulate 3D scenes post-reconstruction. While significant strides have been made in the field of 3D reconstruction and generation, efficient and intuitive techniques for 3D content editing lag behind. The ReplaceAnything3D model (RAM3D) situates itself as a pioneering advancement in this area. RAM3D is a novel method that leverages text guidance to identify, erase, and replace objects within 3D scenes while ensuring consistency and realism across multiple viewpoints.

Erase-and-Replace Approach

RAM3D's core methodology revolves around an Erase-and-Replace paradigm which is executed in several stages. Initially, object detection and segmentation are informed by natural language prompts through the LangSAM framework, isolating the target object to be eliminated. Subsequently, a text-guided 3D inpainting technique fills in the background vacated by the erased object. The third stage applies a similar technique to generate a new object that aligns with the provided text description. Finally, the generated object is seamlessly integrated into the scene. A neural radiance field (NeRF) is then applied to these edited multi-view images, resulting in a novel 3D scene representation that can be rendered from new viewpoints.

The proposed model exhibits superiority in object replacement within 3D scenes, overcoming the challenges encountered by traditional 2D methods in maintaining multi-view consistency. By leveraging the strength of image diffusion models and learned 3D scene representations, alongside Hifa's text-to-3D distillation approach, RAM3D introduces a compositional structure that significantly enhances the visual coherence of the edited scenes.

Methodology

The ReplaceAnything3D framework introduces a unique pipeline:

The Erase stage employs a novel text-guided 3D inpainting technique for background restoration, optimizing parameters to implicitly represent an accurately repainted scene behind the removed element.
During the Replace stage, new objects prescribed by text prompts are generated and composited over the repainted background, employing pre-trained inpainting diffusion models.
The final step involves creating a modified training dataset using the edited views to train a new NeRF, thereby synthesizing the modified scene from unexplored viewpoints.

Results and Contributions

In evaluating RAM3D's capabilities, the paper outlines distinct contributions. The model successfully implements a local edit that replaces high-resolution scene objects specified by users. Furthermore, the process is adept at removing or incorporating multiple objects within a 3D scene, demonstrating robustness across various scene types including both forward-facing and 360-degree scenarios.

Through extensive experimentation, RAM3D has exhibited quantitatively impressive results. The model showcases its prowess not only in replacing objects but also in adept object removal and addition to scenes. An innovative feature of RAM3D is the capability for users to integrate personalized assets into scenes—by fine-tuning a diffusion model with images of an object, RAM3D is able to incorporate or replace objects with custom content.

Conclusion and Future Directions

ReplaceAnything3D emerges as a significant leap forward in the arena of text-guided 3D scene editing. Its multi-stage approach provides remarkable flexibility, enabling users to perform intricate edits that were previously challenging. Looking ahead, the paper identifies opportunities for extending RAM3D to handle other scene representations, further refine editing controls, and expedite the editing process. RAM3D thus sets the stage for substantial future advancements in 3D content creation and manipulation, promising new horizons in VR/MR, gaming, and digital media.