HTB-SR: Region-Aware Portrait Restoration
- HTB-SR is a region-aware super-resolution approach that restores human head, torso, and background together to eliminate boundary artifacts and preserve identity.
- It integrates unified diffusion models with LoRA adapters, mask conditioning, and parallel branch architectures for specialized enhancement of each region.
- Extensive experiments using metrics like PSNR, SSIM, and FID demonstrate improved fidelity, realistic transitions, and seamless fusion across diverse portrait elements.
Head-Torso-Background Super-Resolution (HTB-SR) refers to region-aware methods for restoring and upscaling portrait images or video frames containing a human head, an adjoining torso, and a natural background. Unlike generic image super-resolution (ISR) and isolated facial super-resolution, HTB-SR addresses holistic tasks where fidelity, sharpness, and identity preservation must be balanced across these semantically distinct regions—often under region-specific constraints such as seamless fusion and dataset-specific supervision.
1. Technical Problem Formulation and Unique Challenges
The HTB-SR paradigm generalizes portrait super-resolution by treating the head, torso, and background as distinct but interrelated regions within the restoration process. Traditional approaches typically deploy separate models for face and non-face areas, followed by ad-hoc blending. These pipelines incur boundary artifacts at the face–torso seam due to divergent optimization objectives and domain gaps between regions, an effect clearly observed in prior “blending” pipelines (Li et al., 10 Oct 2025).
HTB-SR aims to resolve:
- Seamless synthesis around highly perceptually sensitive head regions.
- Restoration of natural torso proportions and movement (especially for talking-portrait videos).
- Background inpainting and replacement without visible bleed-through or discontinuities.
2. Model Architectures and Region-Aware Representations
(a) Single-Model Diffusion Approaches
“HeadsUp! High-Fidelity Portrait Image Super-Resolution” (Li et al., 10 Oct 2025) advances a unified backbone for HTB-SR. The model is a one-step latent diffusion network starting from Stable Diffusion v2.1, incorporating LoRA adapters into both the VAE encoder and the noise-prediction UNet. Special conditioning occurs via:
- Downsampled binary face mask , identifying the head area.
- (Optional) reference image to guide identity restoration for the head. Input configuration widens the UNet’s first convolution from 4 to 9 channels, concatenating latent representations for both (low-quality input), , and the mask .
A single-step DDIM update jointly restores all regions:
Decoding yields the restored image , with region-awareness reinforced by mask and reference conditioning.
(b) Parallel Branch and Fusion Mechanisms
The HTB-SR module in “Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis” (Ye et al., 2024) employs three parallel branches:
- Head Super-Resolution: 4 upsampling blocks (nearest-neighbor, Conv2D, LeakyReLU), upscaling the low-res head.
- Torso Warping: Torso appearance encoder with 3 Conv2D layers; motion derived from 68 keypoints (KP) sampled from a fitted 3DMM. Dense Motion Estimator (DME) and Deformation-Based Decoder (DBD) yield translated torso features.
- Background Inpainting: K-nearest neighbor fill for missing pixels in the background segment, followed by a 3-layer convolutional appearance encoder.
Fusion proceeds via learned alpha-blending masks to ensure sharp seams:
where .
3. Loss Functions and Supervision Strategies
HeadsUp (Li et al., 10 Oct 2025):
Composite loss function (Eq. 2):
- : Global image loss (MSE + LPIPS + VSD) over the full image, supervising head, torso, and background jointly.
- : Face-region loss, combining region-specific fidelity (MSE + LPIPS), identity (ArcFace embedding cosine similarity, with optional reference), and GAN-based adversarial term.
- : LoRA weight decay and network smoothness regularization.
Notably, torso and background supervision is implicit via the global losses—no additional mask or region-specific loss is deployed.
Real3D-Portrait (Ye et al., 2024):
The HTB-SR branch is trained on the full video frame:
- Pixel-wise loss.
- Perceptual loss uses activations from both VGG-19 and VGGFace ().
- Dual adversarial discriminators (coarse and fine) following EG3D for realism and multi-view consistency.
4. Datasets, Training Protocols, and Evaluation Metrics
HeadsUp (Li et al., 10 Oct 2025):
- Dataset: PortraitSR-4K, 30,000 in-the-wild “portrait” images (), aspect ratio in .
- Face Mask Generation: FaceLib detector, canonical alignment via affine warp.
- Training: 27,000 portraits, yielding 163,000 triplets (degraded input, high-quality target, reference image if available).
- Testing: 3,000 portraits, 190 with valid reference.
Metrics:
- Full-image (HTB total): PSNR, SSIM, LPIPS, DISTS, FID, NIQE, MUSIQ, MANIQA.
- Identity-score (mean ArcFace cosine over aligned heads).
- User-study win-rate on identity (WR_id) and face naturalness (WR_N).
- Implicit torso/background quality is captured via global metrics under .
Real3D-Portrait (Ye et al., 2024):
- Dataset: CelebV-HQ (35,666 video clips at ).
- Preprocessing: Face parsing for masks; 3DMM fitting for keypoints.
- Training: All prior modules (I2P, motion adapter, volume renderer) frozen.
- Evaluation: Same-identity video-driven setting.
Metrics:
- CSIM (ArcFace cosine similarity), FID, AED (average expression distance), APD (average pose distance).
5. Effects of Architecture and Ablation Studies
HTB-SR module contribution in Real3D-Portrait is demonstrated through ablation (Table 4) (Ye et al., 2024):
| Method | CSIM | FID | AED | APD |
|---|---|---|---|---|
| Full HTB-SR | 0.758 | 42.37 | 0.138 | 0.022 |
| w/o Background Inpainting | 0.744 | 43.95 | — | — |
| w/ Concatenation (no alpha) | 0.737 | 46.38 | — | — |
| w/ Unsupervised KP | 0.746 | 44.86 | — | — |
This demonstrates that alpha-blending, explicit background inpainting, and predefined keypoints for the torso are crucial for optimal region-aware super-resolution and artifact-free synthesis.
HeadsUp ablations (Li et al., 10 Oct 2025) show that region-aware face losses are essential for balancing fidelity and identity:
| Loss Module | PSNR | ID-score | NIQE | FID |
|---|---|---|---|---|
| No face loss | 25.07 | 0.26 | — | — |
| + | 25.65 | 0.30 | — | — |
| + | 25.85 | 0.43 | 5.72 | — |
| + Both face losses | 25.74 | 0.36 | — | 99.55 |
| + Reference | — | 0.37 | — | — |
6. Region-Specific Qualitative Outcomes and Control
HTB-SR methods produce:
- Natural face synthesis and identity preservation: HeadsUp achieves seamless restoration of facial regions without visible seams or boundary artifacts, outperforming prior blending pipelines (Li et al., 10 Oct 2025).
- Realistic torso motion and background control: Real3D-Portrait supports ‘switchable’ backgrounds at inference by swapping, and yields convincing head-torso transitions even under extreme poses (Ye et al., 2024).
- Boundary sharpness: Alpha-blending fusion eliminates hollow-hair and transition blur prevalent with naive concatenation (Ye et al., 2024).
- Failure modes: Isolated face-only loss (identity or fidelity) leads to over-smooth outputs or reduced identity scores, necessitating composite region-aware objectives (Li et al., 10 Oct 2025).
7. Significance, Applications, and Limitations
HTB-SR is foundational for:
- High-fidelity social media portrait restoration at multi-region scale.
- One-shot realistic talking-portrait video synthesis with controllable backgrounds, central in avatar generation (Ye et al., 2024).
- End-to-end architectures that mitigate artifacts and domain shift intrinsic to region-blending approaches.
A plausible implication is that further architectural integration of explicit region cues (mask, reference, keypoints) with unified training pipelines is crucial for seamless multi-region super-resolution. Direct region losses are essential for perceptually sensitive head areas, whereas torso and background may rely on image-level objectives, yet fusion mechanisms (such as alpha-blending) remain indispensable for artifact-free outputs.
HTB-SR does not currently report standalone torso/background benchmarks, as full-image metrics suffice to assess their restoration quality implicitly (Li et al., 10 Oct 2025). This suggests future work may include explicit regionwise breakdowns or augmentations for more granular evaluation.
References
- "HeadsUp! High-Fidelity Portrait Image Super-Resolution" (Li et al., 10 Oct 2025)
- "Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis" (Ye et al., 2024)