TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

Published 26 Jan 2024 in cs.CV | (2401.14828v3)

Abstract: Text-driven 3D scene editing has gained significant attention owing to its convenience and user-friendliness. However, existing methods still lack accurate control of the specified appearance and location of the editing result due to the inherent limitations of the text description. To this end, we propose a 3D scene editing framework, TIPEditor, that accepts both text and image prompts and a 3D bounding box to specify the editing region. With the image prompt, users can conveniently specify the detailed appearance/style of the target content in complement to the text description, enabling accurate control of the appearance. Specifically, TIP-Editor employs a stepwise 2D personalization strategy to better learn the representation of the existing scene and the reference image, in which a localization loss is proposed to encourage correct object placement as specified by the bounding box. Additionally, TIPEditor utilizes explicit and flexible 3D Gaussian splatting as the 3D representation to facilitate local editing while keeping the background unchanged. Extensive experiments have demonstrated that TIP-Editor conducts accurate editing following the text and image prompts in the specified bounding box region, consistently outperforming the baselines in editing quality, and the alignment to the prompts, qualitatively and quantitatively.

Abstract PDF Upgrade to Chat

Citations (26)

View on Semantic Scholar

Summary

The paper introduces a 3D editing framework that accepts text and image prompts with a specified 3D bounding box for precise user-defined edits.
The approach leverages a stepwise 2D personalization strategy using attention-based localization loss and LoRA layers tailored to reference images.
Extensive evaluations across varied scenarios demonstrate TIP-Editor’s superior editing quality and accuracy compared to existing methods.

Introduction

A 3D scene editing framework, TIP-Editor, is introduced to the field, possessing the proficiency to accept both text and image prompts complemented by a 3D bounding box to specify the editing region. This innovative tool provides enhanced accuracy by allowing users to incorporate detailed appearance and style cues from reference images in addition to textual instructions.

Accurate Personalization

Within TIP-Editor lies a stepwise 2D personalization strategy, comprising two pivotal aspects. Firstly, there's an attention-based localization loss designed to ensure edits occur within the user-defined region. Secondly, novel content personalization leverages LoRA layers, which are tailored to the reference image, thereby enabling precise control over the appearance and location of the edits.

Flexible 3D Representation

TIP-Editor employs 3D Gaussian splatting as its core 3D representation, favored due to its efficiency and highly conducive nature for localized editing tasks. The explicit point data structure of 3D Gaussian splatting enables the preservation of the unedited background while facilitating detailed and localized adjustments to the scene.

Comprehensive Evaluation

The robustness of TIP-Editor is validated through extensive testing across various scenarios including objects, human faces, and outdoor scenes to ascertain its performance. Qualitative and quantitative assessments show TIP-Editor's superiority over existing methods in editing quality, the fidelity of appearance to both text and image prompts, and user satisfaction. The framework successfully captures the uniqueness dictated by reference images, greatly advancing the controllability and practicality of 3D scene editing.

Contributions and Capabilities

The innovative aspects of TIP-Editor are numerous, from presenting a versatile framework that supports multifaceted editing operations to introducing a novel step-wise personalization approach that enables accurate control over the editing process. Furthermore, the adoption of 3D Gaussian splatting proves pivotal for precise local editing, demonstrating the capabilities of TIP-Editor in delivering user-defined outcomes with remarkable accuracy and detail.