Soft-Masked SDS Loss: Localized 3D Editing

Updated 23 December 2025

Soft-masked SDS Loss is a loss formulation that utilizes view-consistent soft segmentation masks to enable precise part-level edits in 3D Gaussian splatting representations.
It integrates additional regularization terms, including an L1 anchor loss and Gaussian prior removal, to ensure stability and restrict gradient flow to targeted regions.
Implemented in the RoMaP framework, this approach overcomes the global editing bias of vanilla SDS by confining modifications to specific semantic parts.

Soft-masked Score Distillation Sampling (SDS) loss is a loss formulation designed for localized 3D editing within the framework of 3D Gaussian Splatting, with applications in precise part-level modifications of 3D scene representations. It integrates a soft part-based mask into the SDS loss, restricts gradient flow to specific semantic regions, employs additional regularization terms, and leverages a robust, view-consistent soft segmentation mask generated by a geometry-aware module. This approach addresses the instability and global-editing bias typical of vanilla SDS when applied to complex 3D objects, enabling robust, controlled, and visually consistent edits to targeted parts while preserving context. The primary implementation is found in the RoMaP framework, which demonstrates state-of-the-art local 3D editing on reconstructed and generated Gaussian scenes and objects (Kim et al., 15 Jul 2025).

1. Standard Score Distillation Sampling (SDS) Loss

Score Distillation Sampling (SDS) is a text-driven loss first introduced in 3D generative modeling to propagate semantic feedback from pretrained 2D diffusion models (such as Stable Diffusion) into 3D representations. Let $g_\theta$ denote a differentiable 3D renderer parameterized by Gaussian splat representation $\theta$ . For a diffusion model's noise-prediction network $\epsilon_\phi(x_t; t, c)$ , where $c$ is the conditioning prompt, the forward noising process for an image $x_0$ is written as: $x_t = \alpha_t\,x_0 + \sigma_t\,\epsilon,\quad \epsilon\sim\mathcal{N}(0,I),$ The conventional SDS loss is

$L_{\rm SDS}(\theta) = \mathbb{E}_{t,\epsilon}\left[w(t)\left\|\epsilon_\phi(x_t; t, c) - \epsilon\right\|_2^2\right],$

where $w(t)$ is a timestep-dependent weight (often unity). The loss provides a gradient signal that is back-propagated via the differentiable renderer into the parameters governing the 3D scene.

A direct application of standard SDS to 3D Gaussians leads to global modifications, as the loss is computed uniformly across all pixels, causing undesirable global appearance changes instead of focused, controllable semantic part edits (Kim et al., 15 Jul 2025).

2. View-Consistent Soft 3D Mask Generation via 3D-GALP

To enable precise part-level editing, RoMaP introduces 3D-Geometry Aware Label Prediction (3D-GALP), which generates view-consistent soft segmentation masks for target parts. Each 3D Gaussian $\Omega_i$ is augmented with a spherical harmonics (SH) color parameter $\mathbf{r}_i$ . For any viewing direction $\phi$ , the SH parameter produces the per-Gaussian, view-dependent “segmentation-SH” image via

$\mathbf{r}_i^\phi = SH(\mathbf{r}_i, \phi)\,, \quad \mathbf{R}^\phi = \mathcal{D}\left\{ \mathbf{r}_i^\phi \right\}_{i=1}^N,$

with $\mathcal{D}$ as the splatting rasterizer. The parameters $\{\mathbf{r}_i\}$ are fit by matching aggregated SH-projected images $\mathbf{R}^\phi$ against multi-view attention maps for the target part, minimizing the $\ell_1$ norm.

Averaging over all views produces a mean feature vector $\bar{\mathbf{r}}_i$ , which—in combination with cosine similarity against label embeddings $\{l_k\}$ —yields the soft-label probability distribution for each Gaussian: $p_{ik} = \frac{\exp\left(\bar{\mathbf{r}}_i \cdot l_k / \|\bar{\mathbf{r}}_i\|\|l_k\|\right)}{\sum_{k'} \exp\left(\bar{\mathbf{r}}_i \cdot l_{k'}/(\|\bar{\mathbf{r}}_i\|\|l_{k'}\|)\right)},$ For a target label $l_j$ , the 3D mask value is $M_{3D}(i) = p_{ij}$ . Projecting this 3D mask into each rendered 2D view $\phi$ produces a soft mask: $M^\phi(u) = \sum_i \mathcal{D}_{u,i} M_{3D}(i),$ where $\mathcal{D}_{u,i}$ is the contribution of Gaussian $i$ to pixel $u$ . This soft mask modulates the SDS-induced gradient flow so that optimization is spatially localized to the part of interest.

3. Regularized Soft-masked SDS Loss Formulation

RoMaP’s soft-masked SDS loss combines three main terms: (1) a mask-restricted SDS term, (2) an $\ell_1$ “anchor” loss employing a part-edited image as target, and (3) a Gaussian prior removal term for local context control: $L_{\rm total} = \lambda_s\,\hat{L}_{\rm SDS}(c^\phi_{\rm pr}, p_{\rm edit}) + \lambda_a\,\hat{L}_1(c^\phi_{\rm pr}, \mathrm{SLaMP}(c^\phi_{\rm pr})) + \lambda_g L_{\rm prior},$ where $\hat{L}_{\rm SDS}$ and $\hat{L}_1$ are masked by $M^\phi$ , $p_{\rm edit}$ is the edit prompt, and $c^\phi_{\rm pr}$ is the 2D rendering after prior removal.

The masked SDS term evaluates

$\hat{L}_{\rm SDS} = \mathbb{E}_{t,\epsilon}\left[ \sum_u M^\phi(u) \left\| \epsilon_\phi(x_t(u); t, p_{\rm edit}) - \epsilon(u) \right\|^2 \right]$

Over pixels, only those within the soft mask contribute, yielding strictly localized semantic guidance. Gradient steps are only taken for Gaussians with mask values above a set threshold, freezing the remainder and confining change to the target region.

4. Auxiliary Regularization: Anchor Loss and Gaussian Prior Removal

Two additional regularizers are integrated to maintain edit locality and improve edit stability:

Anchor loss (masked $\ell_1$ ):

$\hat{L}_1 = \sum_u M^\phi(u) \left| c^\phi_{\rm pr}(u) - \mathrm{SLaMP}(c^\phi_{\rm pr})(u) \right|$

This term enforces that Gaussians inside the mask produce part-edited renderings consistent with those generated by Scheduled Latent Mixing and Part (SLaMP) editing, thus anchoring part modifications.

Gaussian prior removal:

The 2D color in the edited region is mixed with a neutral color (e.g., gray/white) outside the mask:

$c^\phi_{\rm pr}(u) = M^\phi(u) c^\phi(u) + (1 - M^\phi(u)) c_{\rm neutral},$

Gaussians with negligible mask activation ( $M_{3D}(i)\approx 0$ ) are frozen, prohibiting their update—a hard barrier preventing error propagation or drift outside the intended region.

This combination of soft masking, anchor, and prior-removal regularization explicitly stabilizes edits and constrains transformations to the spatial target provided by the 3D-GALP-generated mask (Kim et al., 15 Jul 2025).

5. Scheduled Latent Mixing and Part (SLaMP) Editing

SLaMP is an auxiliary procedure for generating high-quality, part-modified anchor images, which form the reference for the anchor loss. At each diffusion timestep $t$ , SLaMP interpolates latent codes such that only the target region changes: $\mathbf{z}_{t+1} = \mathbf{z}_t \left[1 - \mathcal{F}_t (1 - M_{2D})\right] + \mathbf{z}_{t, \rm orig}\left[ \mathcal{F}_t (1 - M_{2D}) \right],$ where $\mathcal{F}_t$ is a sharp schedule (low for $t < t_s$ , high for $t > t_s$ ) that allows the non-target region to remain unedited after a warmup period. Decoding the final latent yields $\mathrm{SLaMP}(c^\phi_{\rm pr})$ .

Through anchor loss, SLaMP ensures drift is prevented outside of the part; only the part is edited, and background consistency is preserved, mitigating artifacts common in unrestricted SDS adaptations.

6. Operational Workflow and Empirical Comparison to Vanilla SDS

The practical optimization workflow per iteration and view $\phi$ is as follows:

Render $c^\phi$ using the current Gaussian parameters.
Apply Gaussian prior removal to yield $c^\phi_{\rm pr}$ .
Generate the soft mask $M^\phi$ via 3D-GALP.
Sample $(t, \epsilon)$ and form noised input $x_t$ .
Compute diffusion model prediction $\epsilon_\phi(x_t; t, p_{\rm edit})$ .
Formulate and back-propagate the masked SDS residual $[M^\phi \cdot (\epsilon_\phi - \epsilon)]$ .
Generate the part-edited anchor image via SLaMP, compute masked $\ell_1$ loss to $c^\phi_{\rm pr}$ , back-propagate.
Freeze gradients for Gaussians where $M_{3D}(i) < \tau$ .

In contrast, vanilla SDS would transmit semantic gradients indiscriminately, leading to unintended edits and drift throughout the entire model, eroding local geometry and appearance. The soft-masked approach, modulating both SDS and anchor loss terms, ensures edit locality and prevents interference with non-targeted regions (Kim et al., 15 Jul 2025).

7. Context, Significance and Implications

Soft-masked SDS loss as formalized in RoMaP resolves a fundamental challenge in 3D generative editing: the ability to control, confine, and stabilize semantic edits to arbitrary object parts within Gaussian Splatting-based 3D representations. By introducing a robust, view-consistent soft mask via 3D-GALP, augmenting the SDS loss with prior-removal and anchor-based regularization, and leveraging SLaMP for faithful anchor supervision, this approach avoids both the ambiguity and global artifacts inherent in prior unconstrained SDS workflows. The methodology provides a direct and practical means to achieve local, high-fidelity part editing and is extendable to a variety of instance-level 3D generative and editing applications (Kim et al., 15 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Soft-Masked SDS Loss.