Papers
Topics
Authors
Recent
Search
2000 character limit reached

RemEdit Diffusion Framework

Updated 1 February 2026
  • The paper introduces a novel diffusion-based framework that leverages Riemannian manifold navigation and dual-SLERP blending to achieve high-fidelity and controllable image edits.
  • It employs goal-aware prompt enrichment via a vision-language model and task-specific attention pruning to enhance semantic consistency and accelerate inference.
  • Empirical benchmarks on datasets like CelebA-HQ and LSUN-Church demonstrate superior accuracy (S_dir up to 0.1982) and rapid runtimes, affirming its practical effectiveness.

RemEdit is a diffusion-based image editing framework designed to reconcile the trade-off between semantic fidelity and inference speed in controllable generative AI. It achieves this through a synergistic integration of Riemannian geometry for latent space navigation, dual spherical interpolation (SLERP), grounded prompt enrichment, and task-specific attention pruning. RemEdit demonstrates state-of-the-art editing accuracy and real-time performance benchmarks across various datasets, substantiating its practicality and robustness for high-fidelity image manipulation (Adhikarla et al., 25 Jan 2026).

1. Riemannian Manifold Navigation in Latent Space

RemEdit models the U-Net bottleneck feature space ("h-space") as a Riemannian manifold (M,g)(\mathcal M, g) of dimension N=C×H×WN = C \times H \times W. The metric tensor gij(h)=i,jg_{ij}(h) = \langle \partial_i, \partial_j \rangle induces inner products on the tangent space ThMT_h\mathcal M, enabling the computation of geodesics—paths following the data distribution instead of mere straight Euclidean offsets. The framework learns a non-trivial affine connection \nabla parameterized by Christoffel symbols Γijk(h)\Gamma^k_{ij}(h), which underlie geodesic calculations in feature space.

A lightweight "Mamba" network fθf_\theta takes the concatenated feature vector y0=concat(h,  timestep)y_0 = \mathrm{concat}(h,\;\mathrm{timestep}) and predicts the Christoffel symbols. In parallel, a tangent-vector predictor vϕv_\phi outputs an initial velocity v0=vϕ(y0)v_0 = v_\phi(y_0), constrained to the unit ball via tanh-based retraction. Geodesic endpoints are obtained by integrating the following ODE system over t[0,1]t \in [0, 1]: {p˙(t)=q(t) q˙(t)=Γ(p(t))[q(t),q(t)]\begin{cases} \dot{p}(t) = q(t)\ \dot{q}(t) = -\Gamma(p(t))[q(t), q(t)] \end{cases} yielding exph(v0)=γ(1)\exp_h(v_0) = \gamma(1) and the geodesic edit Δhgeo=exph(v0)h\Delta h_{\rm geo} = \exp_h(v_0) - h. The model is trained to match example edits Δhgt\Delta h_{\text{gt}} using the objective

Lgeo=Eh,Δhgtexph(v0;Γθ)(h+Δhgt)2\mathcal L_{\mathrm{geo}} = \mathbb E_{h,\Delta h_{\rm gt}}\Bigl\|\,\exp_h(v_0;\,\Gamma_\theta) - (h + \Delta h_{\rm gt})\Bigr\|^2

This Riemannian approach preserves the semantics of edits by respecting the intrinsic structure of the latent distribution.

2. Dual-SLERP Blending for Edit and Identity Control

RemEdit employs a dual-SLERP approach, hierarchically interpolating between the original and edited states in both feature and noise spaces. Given two unit vectors uu and vv with angle Ω=arccosu,v\Omega = \arccos\langle u, v\rangle, the SLERP is defined as

SLERP(u,v;α)=sin((1α)Ω)sinΩu+sin(αΩ)sinΩv\mathrm{SLERP}(u, v; \alpha) = \frac{\sin((1-\alpha)\Omega)}{\sin\Omega}u + \frac{\sin(\alpha\Omega)}{\sin\Omega}v

The inner SLERP applies this to hh (original latent) and hgeoh_{\rm geo} (geodesic endpoint), facilitating continuous modulation of edit strength: h=SLERP(h,hgeo;αinner)h' = \mathrm{SLERP}(h, h_{\rm geo}; \alpha_{\rm inner}) The outer SLERP operates in noise space after U-Net forward passes. It separates semantic change (identity-orthogonal component) from fidelity, fusing predictions as: x0=SLERP(x0fid,o;αouter),o=x0semx0sem,x0fidx0fid2x0fidx_0' = \mathrm{SLERP}(x_0^{\rm fid}, o; \alpha_{\rm outer}), \quad o = x_0^{\rm sem} - \frac{\langle x_0^{\rm sem}, x_0^{\rm fid}\rangle}{\|x_0^{\rm fid}\|^2}x_0^{\rm fid} This dual mechanism enables explicit, fine-grained control over semantic transformation and identity retention.

3. Goal-Aware Semantic Prompt Enrichment

Textual edit prompts are typically under-specified for targeted manipulations (e.g., "face → face with makeup"). RemEdit enriches prompts by extracting a fine-grained caption cenrichc_{\rm enrich} from the input image x0x_0 using a pretrained Vision-LLM (Qwen2-VL): cenrich=Qwen2VL(x0)c_{\rm enrich} = \mathrm{Qwen2VL}(x_0) Semantic edit direction deditd_{\rm edit} is then computed in text-embedding space (e.g., via CLIP): dedit=Etext(ctarget)Etext(cenrich)d_{\rm edit} = E_{\rm text}(c_{\rm target}) - E_{\rm text}(c_{\rm enrich}) This approach grounds edits in actual image content, enhancing consistency and specificity without the need for additional training—prompt enrichment is a single forward pass through the VLM.

4. Task-Specific Attention Pruning

RemEdit introduces a novel attention pruning mechanism to accelerate inference while maintaining semantic fidelity. Each self-attention block's feature map XRB×C×H×WX \in \mathbb R^{B \times C \times H \times W} is reshaped into tokens TRB×N×CT \in \mathbb R^{B \times N \times C}, with N=HWN = H \cdot W. An MLP pruner Pθ\mathcal P_\theta processes (T,dedit)(T, d_{\rm edit}) to produce soft importance scores S[0,1]B×NS \in [0, 1]^{B \times N}. During inference, a pruning rate ρ\rho determines the top k=N(1ρ)k = \lfloor N(1-\rho)\rfloor tokens to retain; attention is computed only over these, yielding: A=Softmax(QkeptKkeptC)VkeptA = \mathrm{Softmax}\left(\frac{Q_{\rm kept}K_{\rm kept}^\top}{\sqrt C}\right)V_{\rm kept} The pruner is trained to optimize

L=XoutprunedXoutfull2+λsparsityiSi\mathcal L = \|X_{\rm out}^{\rm pruned} - X_{\rm out}^{\rm full}\|^2 + \lambda_{\rm sparsity}\sum_{i} S_i

enabling effective acceleration without degrading semantic edit quality.

5. Empirical Performance and Benchmarks

RemEdit demonstrates state-of-the-art results on several benchmark datasets. Quantitative metrics on CelebA-HQ (256×256256 \times 256) include directionality score (SdirS_{\rm dir}), segmentation consistency, Fréchet Inception Distance (FID), and runtime:

Method SdirS_{\rm dir} \uparrow Seg. Consistency \uparrow FID^* Time (s) \downarrow
Asyrp (h-space) 0.1900 87.9% 24.3 28.9
LEdits++ 0.1820 89.7% 22.5 20.1
RemEdit (full) 0.1982 92.4% 19.8 2.8

(\ast FID evaluated on 250 edited samples vs. held-out real images.)

On LSUN-Church, RemEdit similarly outperforms Diffusion-CLIP and BoundaryDiffusion in both SdirS_{\rm dir} and segmentation consistency. Ablation experiments (CelebA-HQ, "smiling" edit) further show that geodesic navigation and dual-SLERP deliver superior semantic accuracy, and that up to 50% attention pruning (ρ=0.5\rho=0.5) preserves most fidelity while reducing runtime to approximately 2.31 s.

6. End-to-End Algorithmic Structure

RemEdit’s computational workflow integrates the above components into a coherent editing pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
c_enrich = Qwen2-VL(x)
d_edit   = E_text(T_target) - E_text(c_enrich)

h = DDIM_invert(x; t, S_for)

y = concat(h, timestep_embedding)
v = Retract(v_φ(y))
Γ  = f_θ(y)
γ(1) = integrate_geodesic(h, v, Γ)
Δh_geo = γ(1) - h

h_geo = h + Δh_geo
h    = SLERP(h, h_geo; α_in)

for each self-attention block:
    T = reshape_features_to_tokens(h)
    S = P_θ(T, d_edit)
    I = top_k_indices(S, 1ρ)
    h = sparse_self_attention(h, I)

x^fid = U-Net_forward(h)
x^sem = U-Net_forward(h + Δh_geo)

o   = x^sem  proj_{x^fid}(x^sem)
x' = SLERP(x₀^fid, o; α_out)

return x'

This architecture demonstrates the synthesis of geometric, linguistic, and computational efficiency innovations, each contributing distinctly to controllable, high-speed image editing.

7. Significance and Implications

RemEdit's unified approach—learned Riemannian manifold navigation, dual-stage SLERP blending, prompt enrichment via VLM, and semantic attention pruning—addresses core limitations of prior diffusion editors. It sets new standards in edit fidelity (SdirS_{\rm dir} up to 0.198 on CelebA-HQ) and runtime (2.3\sim 2.3 s at 50% pruning), evidencing not only superior semantic accuracy but also practical deployment feasibility. A plausible implication is that future generative editing frameworks may increasingly rely on data-driven geometric structures and task-aware resource allocation to reconcile interpretability, controllability, and efficiency (Adhikarla et al., 25 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RemEdit Diffusion-Based Framework.