RemEdit Diffusion Framework
- The paper introduces a novel diffusion-based framework that leverages Riemannian manifold navigation and dual-SLERP blending to achieve high-fidelity and controllable image edits.
- It employs goal-aware prompt enrichment via a vision-language model and task-specific attention pruning to enhance semantic consistency and accelerate inference.
- Empirical benchmarks on datasets like CelebA-HQ and LSUN-Church demonstrate superior accuracy (S_dir up to 0.1982) and rapid runtimes, affirming its practical effectiveness.
RemEdit is a diffusion-based image editing framework designed to reconcile the trade-off between semantic fidelity and inference speed in controllable generative AI. It achieves this through a synergistic integration of Riemannian geometry for latent space navigation, dual spherical interpolation (SLERP), grounded prompt enrichment, and task-specific attention pruning. RemEdit demonstrates state-of-the-art editing accuracy and real-time performance benchmarks across various datasets, substantiating its practicality and robustness for high-fidelity image manipulation (Adhikarla et al., 25 Jan 2026).
1. Riemannian Manifold Navigation in Latent Space
RemEdit models the U-Net bottleneck feature space ("h-space") as a Riemannian manifold of dimension . The metric tensor induces inner products on the tangent space , enabling the computation of geodesics—paths following the data distribution instead of mere straight Euclidean offsets. The framework learns a non-trivial affine connection parameterized by Christoffel symbols , which underlie geodesic calculations in feature space.
A lightweight "Mamba" network takes the concatenated feature vector and predicts the Christoffel symbols. In parallel, a tangent-vector predictor outputs an initial velocity , constrained to the unit ball via tanh-based retraction. Geodesic endpoints are obtained by integrating the following ODE system over : yielding and the geodesic edit . The model is trained to match example edits using the objective
This Riemannian approach preserves the semantics of edits by respecting the intrinsic structure of the latent distribution.
2. Dual-SLERP Blending for Edit and Identity Control
RemEdit employs a dual-SLERP approach, hierarchically interpolating between the original and edited states in both feature and noise spaces. Given two unit vectors and with angle , the SLERP is defined as
The inner SLERP applies this to (original latent) and (geodesic endpoint), facilitating continuous modulation of edit strength: The outer SLERP operates in noise space after U-Net forward passes. It separates semantic change (identity-orthogonal component) from fidelity, fusing predictions as: This dual mechanism enables explicit, fine-grained control over semantic transformation and identity retention.
3. Goal-Aware Semantic Prompt Enrichment
Textual edit prompts are typically under-specified for targeted manipulations (e.g., "face → face with makeup"). RemEdit enriches prompts by extracting a fine-grained caption from the input image using a pretrained Vision-LLM (Qwen2-VL): Semantic edit direction is then computed in text-embedding space (e.g., via CLIP): This approach grounds edits in actual image content, enhancing consistency and specificity without the need for additional training—prompt enrichment is a single forward pass through the VLM.
4. Task-Specific Attention Pruning
RemEdit introduces a novel attention pruning mechanism to accelerate inference while maintaining semantic fidelity. Each self-attention block's feature map is reshaped into tokens , with . An MLP pruner processes to produce soft importance scores . During inference, a pruning rate determines the top tokens to retain; attention is computed only over these, yielding: The pruner is trained to optimize
enabling effective acceleration without degrading semantic edit quality.
5. Empirical Performance and Benchmarks
RemEdit demonstrates state-of-the-art results on several benchmark datasets. Quantitative metrics on CelebA-HQ () include directionality score (), segmentation consistency, Fréchet Inception Distance (FID), and runtime:
| Method | Seg. Consistency | FID | Time (s) | |
|---|---|---|---|---|
| Asyrp (h-space) | 0.1900 | 87.9% | 24.3 | 28.9 |
| LEdits++ | 0.1820 | 89.7% | 22.5 | 20.1 |
| RemEdit (full) | 0.1982 | 92.4% | 19.8 | 2.8 |
( FID evaluated on 250 edited samples vs. held-out real images.)
On LSUN-Church, RemEdit similarly outperforms Diffusion-CLIP and BoundaryDiffusion in both and segmentation consistency. Ablation experiments (CelebA-HQ, "smiling" edit) further show that geodesic navigation and dual-SLERP deliver superior semantic accuracy, and that up to 50% attention pruning () preserves most fidelity while reducing runtime to approximately 2.31 s.
6. End-to-End Algorithmic Structure
RemEdit’s computational workflow integrates the above components into a coherent editing pipeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
c_enrich = Qwen2-VL(x₀) d_edit = E_text(T_target) - E_text(c_enrich) h = DDIM_invert(x₀; t₀, S_for) y₀ = concat(h, timestep_embedding) v₀ = Retract(v_φ(y₀)) Γ = f_θ(y₀) γ(1) = integrate_geodesic(h, v₀, Γ) Δh_geo = γ(1) - h h_geo = h + Δh_geo h′ = SLERP(h, h_geo; α_in) for each self-attention block: T = reshape_features_to_tokens(h′) S = P_θ(T, d_edit) I = top_k_indices(S, 1−ρ) h′ = sparse_self_attention(h′, I) x₀^fid = U-Net_forward(h′) x₀^sem = U-Net_forward(h′ + Δh_geo) o = x₀^sem − proj_{x₀^fid}(x₀^sem) x₀' = SLERP(x₀^fid, o; α_out) return x₀' |
This architecture demonstrates the synthesis of geometric, linguistic, and computational efficiency innovations, each contributing distinctly to controllable, high-speed image editing.
7. Significance and Implications
RemEdit's unified approach—learned Riemannian manifold navigation, dual-stage SLERP blending, prompt enrichment via VLM, and semantic attention pruning—addresses core limitations of prior diffusion editors. It sets new standards in edit fidelity ( up to 0.198 on CelebA-HQ) and runtime ( s at 50% pruning), evidencing not only superior semantic accuracy but also practical deployment feasibility. A plausible implication is that future generative editing frameworks may increasingly rely on data-driven geometric structures and task-aware resource allocation to reconcile interpretability, controllability, and efficiency (Adhikarla et al., 25 Jan 2026).