RemEdit Diffusion Framework

Updated 1 February 2026

The paper introduces a novel diffusion-based framework that leverages Riemannian manifold navigation and dual-SLERP blending to achieve high-fidelity and controllable image edits.
It employs goal-aware prompt enrichment via a vision-language model and task-specific attention pruning to enhance semantic consistency and accelerate inference.
Empirical benchmarks on datasets like CelebA-HQ and LSUN-Church demonstrate superior accuracy (S_dir up to 0.1982) and rapid runtimes, affirming its practical effectiveness.

RemEdit is a diffusion-based image editing framework designed to reconcile the trade-off between semantic fidelity and inference speed in controllable generative AI. It achieves this through a synergistic integration of Riemannian geometry for latent space navigation, dual spherical interpolation (SLERP), grounded prompt enrichment, and task-specific attention pruning. RemEdit demonstrates state-of-the-art editing accuracy and real-time performance benchmarks across various datasets, substantiating its practicality and robustness for high-fidelity image manipulation (Adhikarla et al., 25 Jan 2026).

RemEdit models the U-Net bottleneck feature space ("h-space") as a Riemannian manifold $(\mathcal M, g)$ of dimension $N = C \times H \times W$ . The metric tensor $g_{ij}(h) = \langle \partial_i, \partial_j \rangle$ induces inner products on the tangent space $T_h\mathcal M$ , enabling the computation of geodesics—paths following the data distribution instead of mere straight Euclidean offsets. The framework learns a non-trivial affine connection $\nabla$ parameterized by Christoffel symbols $\Gamma^k_{ij}(h)$ , which underlie geodesic calculations in feature space.

A lightweight "Mamba" network $f_\theta$ takes the concatenated feature vector $y_0 = \mathrm{concat}(h,\;\mathrm{timestep})$ and predicts the Christoffel symbols. In parallel, a tangent-vector predictor $v_\phi$ outputs an initial velocity $v_0 = v_\phi(y_0)$ , constrained to the unit ball via tanh-based retraction. Geodesic endpoints are obtained by integrating the following ODE system over $t \in [0, 1]$ : $\begin{cases} \dot{p}(t) = q(t)\ \dot{q}(t) = -\Gamma(p(t))[q(t), q(t)] \end{cases}$ yielding $\exp_h(v_0) = \gamma(1)$ and the geodesic edit $\Delta h_{\rm geo} = \exp_h(v_0) - h$ . The model is trained to match example edits $\Delta h_{\text{gt}}$ using the objective

$\mathcal L_{\mathrm{geo}} = \mathbb E_{h,\Delta h_{\rm gt}}\Bigl\|\,\exp_h(v_0;\,\Gamma_\theta) - (h + \Delta h_{\rm gt})\Bigr\|^2$

This Riemannian approach preserves the semantics of edits by respecting the intrinsic structure of the latent distribution.

2. Dual-SLERP Blending for Edit and Identity Control

RemEdit employs a dual-SLERP approach, hierarchically interpolating between the original and edited states in both feature and noise spaces. Given two unit vectors $u$ and $v$ with angle $\Omega = \arccos\langle u, v\rangle$ , the SLERP is defined as

$\mathrm{SLERP}(u, v; \alpha) = \frac{\sin((1-\alpha)\Omega)}{\sin\Omega}u + \frac{\sin(\alpha\Omega)}{\sin\Omega}v$

The inner SLERP applies this to $h$ (original latent) and $h_{\rm geo}$ (geodesic endpoint), facilitating continuous modulation of edit strength: $h' = \mathrm{SLERP}(h, h_{\rm geo}; \alpha_{\rm inner})$ The outer SLERP operates in noise space after U-Net forward passes. It separates semantic change (identity-orthogonal component) from fidelity, fusing predictions as: $x_0' = \mathrm{SLERP}(x_0^{\rm fid}, o; \alpha_{\rm outer}), \quad o = x_0^{\rm sem} - \frac{\langle x_0^{\rm sem}, x_0^{\rm fid}\rangle}{\|x_0^{\rm fid}\|^2}x_0^{\rm fid}$ This dual mechanism enables explicit, fine-grained control over semantic transformation and identity retention.

3. Goal-Aware Semantic Prompt Enrichment

Textual edit prompts are typically under-specified for targeted manipulations (e.g., "face → face with makeup"). RemEdit enriches prompts by extracting a fine-grained caption $c_{\rm enrich}$ from the input image $x_0$ using a pretrained Vision-LLM (Qwen2-VL): $c_{\rm enrich} = \mathrm{Qwen2VL}(x_0)$ Semantic edit direction $d_{\rm edit}$ is then computed in text-embedding space (e.g., via CLIP): $d_{\rm edit} = E_{\rm text}(c_{\rm target}) - E_{\rm text}(c_{\rm enrich})$ This approach grounds edits in actual image content, enhancing consistency and specificity without the need for additional training—prompt enrichment is a single forward pass through the VLM.

4. Task-Specific Attention Pruning

RemEdit introduces a novel attention pruning mechanism to accelerate inference while maintaining semantic fidelity. Each self-attention block's feature map $X \in \mathbb R^{B \times C \times H \times W}$ is reshaped into tokens $T \in \mathbb R^{B \times N \times C}$ , with $N = H \cdot W$ . An MLP pruner $\mathcal P_\theta$ processes $(T, d_{\rm edit})$ to produce soft importance scores $S \in [0, 1]^{B \times N}$ . During inference, a pruning rate $\rho$ determines the top $k = \lfloor N(1-\rho)\rfloor$ tokens to retain; attention is computed only over these, yielding: $A = \mathrm{Softmax}\left(\frac{Q_{\rm kept}K_{\rm kept}^\top}{\sqrt C}\right)V_{\rm kept}$ The pruner is trained to optimize

$\mathcal L = \|X_{\rm out}^{\rm pruned} - X_{\rm out}^{\rm full}\|^2 + \lambda_{\rm sparsity}\sum_{i} S_i$

enabling effective acceleration without degrading semantic edit quality.

5. Empirical Performance and Benchmarks

RemEdit demonstrates state-of-the-art results on several benchmark datasets. Quantitative metrics on CelebA-HQ ( $256 \times 256$ ) include directionality score ( $S_{\rm dir}$ ), segmentation consistency, Fréchet Inception Distance (FID), and runtime:

Method	$S_{\rm dir} \uparrow$	Seg. Consistency $\uparrow$	FID $^*$	Time (s) $\downarrow$
Asyrp (h-space)	0.1900	87.9%	24.3	28.9
LEdits++	0.1820	89.7%	22.5	20.1
RemEdit (full)	0.1982	92.4%	19.8	2.8

( $\ast$ FID evaluated on 250 edited samples vs. held-out real images.)

On LSUN-Church, RemEdit similarly outperforms Diffusion-CLIP and BoundaryDiffusion in both $S_{\rm dir}$ and segmentation consistency. Ablation experiments (CelebA-HQ, "smiling" edit) further show that geodesic navigation and dual-SLERP deliver superior semantic accuracy, and that up to 50% attention pruning ( $\rho=0.5$ ) preserves most fidelity while reducing runtime to approximately 2.31 s.

6. End-to-End Algorithmic Structure

RemEdit’s computational workflow integrates the above components into a coherent editing pipeline:

c_enrich = Qwen2-VL(x₀)
d_edit   = E_text(T_target) - E_text(c_enrich)

h = DDIM_invert(x₀; t₀, S_for)

y₀ = concat(h, timestep_embedding)
v₀ = Retract(v_φ(y₀))
Γ  = f_θ(y₀)
γ(1) = integrate_geodesic(h, v₀, Γ)
Δh_geo = γ(1) - h

h_geo = h + Δh_geo
h′    = SLERP(h, h_geo; α_in)

for each self-attention block:
    T = reshape_features_to_tokens(h′)
    S = P_θ(T, d_edit)
    I = top_k_indices(S, 1−ρ)
    h′ = sparse_self_attention(h′, I)

x₀^fid = U-Net_forward(h′)
x₀^sem = U-Net_forward(h′ + Δh_geo)

o   = x₀^sem − proj_{x₀^fid}(x₀^sem)
x₀' = SLERP(x₀^fid, o; α_out)

return x₀'

This architecture demonstrates the synthesis of geometric, linguistic, and computational efficiency innovations, each contributing distinctly to controllable, high-speed image editing.

7. Significance and Implications

RemEdit's unified approach—learned Riemannian manifold navigation, dual-stage SLERP blending, prompt enrichment via VLM, and semantic attention pruning—addresses core limitations of prior diffusion editors. It sets new standards in edit fidelity ( $S_{\rm dir}$ up to 0.198 on CelebA-HQ) and runtime ( $\sim 2.3$ s at 50% pruning), evidencing not only superior semantic accuracy but also practical deployment feasibility. A plausible implication is that future generative editing frameworks may increasingly rely on data-driven geometric structures and task-aware resource allocation to reconcile interpretability, controllability, and efficiency (Adhikarla et al., 25 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

RemEdit: Efficient Diffusion Editing with Riemannian Geometry (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RemEdit Diffusion-Based Framework.

RemEdit Diffusion Framework

1. Riemannian Manifold Navigation in Latent Space

2. Dual-SLERP Blending for Edit and Identity Control

3. Goal-Aware Semantic Prompt Enrichment

4. Task-Specific Attention Pruning

5. Empirical Performance and Benchmarks

6. End-to-End Algorithmic Structure

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RemEdit Diffusion Framework

1. Riemannian Manifold Navigation in Latent Space

2. Dual-SLERP Blending for Edit and Identity Control

3. Goal-Aware Semantic Prompt Enrichment

4. Task-Specific Attention Pruning

5. Empirical Performance and Benchmarks

6. End-to-End Algorithmic Structure

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research