Hierarchical Proxy-Based Image Representation
- The paper presents a framework that disentangles semantic, geometric, and textural elements, enabling precise and controllable image reconstruction and editing.
- It employs adaptive Bézier curve fitting and multi-scale meshing to construct a hierarchical proxy geometry capturing both boundaries and interior structures.
- The approach integrates disentangled texture embedding with spatial feature indexing to support interactive, physics-driven editing and efficient image compression.
A hierarchical proxy-based parametric image representation is a framework for decomposing and encoding images into independent, disentangled parameter spaces—semantic, geometric, and textural—using a multi-level hierarchy of proxy nodes. Such a representation provides direct, fine-grained correspondence between latent variables and semantic instances or object parts, supporting high-fidelity reconstruction, physically-plausible interactive editing, and compact parametrization. The approach is characterized by semantic-aware decomposition, hierarchical proxy geometry construction (using adaptive Bézier boundaries and multi-scale meshing), implicit texture coding on distributed proxies, and spatial feature indexing for regularity. The disentanglement enables intuitive editing and physics-driven animation, overcoming key weaknesses of both traditional explicit (e.g., raster, Gaussian primitives) and implicit (e.g., latent fields) image representations (Chen et al., 2 Feb 2026, Chen et al., 14 Oct 2025).
1. Foundations and Motivations
Prevailing image representation schemes are divided between explicit formats (raster images, Gaussian primitives) and implicit neural fields (e.g., SIREN-based, NeRF variants). Explicit methods encode geometry and appearance directly but suffer from redundancy, hindering editing. Implicit schemes offer continuous representations but lack interpretable, semantically-aligned latent spaces, limiting controllable manipulation. Hierarchical proxy-based parametric image representations address these issues by:
- Segmenting the image into semantic layers using pretrained instance segmentation models (e.g., SAM) and sorting by instance depth (Depth Anything), yielding independent semantic “layers” for background and foreground objects (Chen et al., 2 Feb 2026).
- Constructing a hierarchy of proxy nodes per layer, capturing both boundary and interior structure at multiple scales via adaptive curve fitting and spatial refinement (Chen et al., 2 Feb 2026, Chen et al., 14 Oct 2025).
- Disentangling geometry (proxy positions/codings) and texture (per-proxy or grid-based codes), ensuring that geometric edits do not entangle with appearance (Chen et al., 14 Oct 2025).
This framework supports manipulation at the instance or part level, reduces redundancy, and yields continuous, editable representations.
2. Hierarchical Proxy Geometry Construction
The hierarchical proxy framework organizes geometric representation into layers and scales:
- Semantic Layering: The initial color image is decomposed into semantic instance masks , with background and foreground layers identified by mean depth (Chen et al., 2 Feb 2026).
- Boundary Fitting (Adaptive Bézier Curves): Each layer’s exterior shape is modeled by a set of adjacent Bézier curve segments , initialized with a modest segment count (e.g., ), then iteratively refined. Each segment is defined by control points :
with refinement driven by Chamfer distance between the Bézier approximation and the binary mask contour, adaptively splitting segments with error exceeding a threshold (Chen et al., 2 Feb 2026).
- Interior Meshing (Multi-scale Proxies): The mask interior is meshed adaptively. The initial mesh (via "Triwild") produces vertices and triangles ; triangles exhibiting high average gradient are subdivided recursively (to levels foreground, background), producing a multi-scale mesh (Chen et al., 2 Feb 2026).
- Proxy Hierarchy Extension: Additional proxies are introduced in sparsely covered regions: wherever maximal distance from proxies exceeds a threshold , a new node is placed and triangulation repeated, enforcing fine spatial coverage (Chen et al., 14 Oct 2025).
The result is a distributed set of 2D proxy coordinates encoding both the boundary and nuanced internal geometry of each semantic instance.
3. Parametric Texture Embedding and Feature Indexing
Texture information is associated with proxy nodes at multiple levels:
- Boundary and Interior Features: Each boundary proxy (on ) and interior vertex (in ) is linked to a learnable feature vector. Boundary features are interpolated at a pixel by inverse-distance weighting; interior features are interpolated barycentrically within a triangle containing (Chen et al., 2 Feb 2026).
- MLP Decoding: The interpolated feature for pixel is concatenated and passed through a fixed high-frequency encoding (Sinusoidal/Fourier) and an MLP to predict RGB color:
- Locality-Adaptive Feature Indexing: To further regularize and compress the representation, a low-resolution learnable feature grid is associated with each layer. Proxy node features are bilinearly interpolated from the four nearest grid cells, enforcing spatial smoothness and compactness independent of node count (Chen et al., 2 Feb 2026).
This redundant-yet-factorized structure supports continuous, high-fidelity synthesis, semantic editing, and robust background completion.
4. Disentanglement, Loss Functions, and Training
Crucial to this approach is explicit separation of semantic/geometry (proxy coordinates, boundary control points) and texture (proxy or grid features):
- Disentangled Spaces: Geometric edits operate solely on control points or mesh vertices, leaving texture features invariant. Texture modification is achieved by altering or swapping proxy codes, or by explicit optimization (e.g., Score Distillation Sampling for generative editing) (Chen et al., 2 Feb 2026, Chen et al., 14 Oct 2025).
- Reconstruction Loss: Supervised training minimizes pixel-wise error between the original and reconstructed image:
with .
- Background Completion and Regularization: Holes (from editing/removal) are filled via the background feature grid with total variation (TV) regularization:
with a regularization coefficient (Chen et al., 2 Feb 2026).
- Generative Editing: Texture features can be further refined by optimizing an SDS loss on feature codes under a text/image conditioning prompt (Chen et al., 2 Feb 2026).
No cross-coupling occurs; geometry and appearance remain independently manipulable throughout all stages.
5. Algorithms and Computational Procedures
Key algorithms specify adaptive geometry extraction, mesh refinement, and feature embedding:
| Algorithm | Inputs | Key Operations |
|---|---|---|
| Adaptive Bézier Fitting | Mask , initial , error | Optimize control points via Chamfer loss; split/refine segments as needed |
| Multi-scale Meshing | , , base resolution | Generate initial mesh; recursively subdivide triangles on gradient criterion |
| Texture Embedding/Training | , feature grids, decoder | Assign features, interpolate per-pixel code, train via reconstruction and TV loss |
Procedures guarantee coverage, reuse, and smooth interpolation suitable for image (and when extended, video) parameterizations (Chen et al., 2 Feb 2026, Chen et al., 14 Oct 2025).
6. Empirical Performance and Practical Applications
When evaluated on standard image reconstruction and editing benchmarks, hierarchical proxy-based representations offer compelling advantages:
- Image Reconstruction (ImageNet 512×512): , , , , with 60K parameters and $2.6$ minutes optimization (RTX3090) (Chen et al., 2 Feb 2026).
- Image Compression (DIV2K): At , achieves , , with $4$ fps decoding.
- Semantic and Texture Editing: Outperforms or matches best explicit/implicit schemes (e.g., GaussianImage, SIREN-5) in geometric (HumanEdit) and appearance (OIR-Bench) manipulations.
- Image Animation: When coupled with position-based dynamics solvers, supports real-time physically-plausible animation (e.g., $4$ fps, FID=$52.5$ on anime test set), with temporal consistency ensured by proxy-linked textures (Chen et al., 2 Feb 2026).
Applications include interactive editing (drag-and-drop, semantic rearrangement), text-driven generative retouching, and physically-inspired animations.
7. Connections to Vectorized and Video Proxy Frameworks
The methodology is closely linked to hierarchical proxy-based video representations, which generalize the principle to spatio-temporally consistent proxy nodes spanning frames (Chen et al., 14 Oct 2025). In single-image adaptations:
- Semantic layering and progressive proxy coverage (contour, high-gradient, coverage augmentation) mirror the mesh-and-code approach of ProxyImg.
- Geometry and texture remain strictly disentangled, with spatial barycentric interpolation for feature decoding.
- Absence of temporal propagation mandates explicit geometric coverage and regularization, but yields a fully differentiable, edit-friendly parametric form.
A plausible implication is that future integrations across image, video, and 3D modalities may further harmonize semantic, geometric, and textural disentanglement with hierarchical proxy frameworks.
References:
ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding (Chen et al., 2 Feb 2026) Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding (Chen et al., 14 Oct 2025)