Papers
Topics
Authors
Recent
Search
2000 character limit reached

HTB-SR: Region-Aware Portrait Restoration

Updated 30 January 2026
  • HTB-SR is a region-aware super-resolution approach that restores human head, torso, and background together to eliminate boundary artifacts and preserve identity.
  • It integrates unified diffusion models with LoRA adapters, mask conditioning, and parallel branch architectures for specialized enhancement of each region.
  • Extensive experiments using metrics like PSNR, SSIM, and FID demonstrate improved fidelity, realistic transitions, and seamless fusion across diverse portrait elements.

Head-Torso-Background Super-Resolution (HTB-SR) refers to region-aware methods for restoring and upscaling portrait images or video frames containing a human head, an adjoining torso, and a natural background. Unlike generic image super-resolution (ISR) and isolated facial super-resolution, HTB-SR addresses holistic tasks where fidelity, sharpness, and identity preservation must be balanced across these semantically distinct regions—often under region-specific constraints such as seamless fusion and dataset-specific supervision.

1. Technical Problem Formulation and Unique Challenges

The HTB-SR paradigm generalizes portrait super-resolution by treating the head, torso, and background as distinct but interrelated regions within the restoration process. Traditional approaches typically deploy separate models for face and non-face areas, followed by ad-hoc blending. These pipelines incur boundary artifacts at the face–torso seam due to divergent optimization objectives and domain gaps between regions, an effect clearly observed in prior “blending” pipelines (Li et al., 10 Oct 2025).

HTB-SR aims to resolve:

  • Seamless synthesis around highly perceptually sensitive head regions.
  • Restoration of natural torso proportions and movement (especially for talking-portrait videos).
  • Background inpainting and replacement without visible bleed-through or discontinuities.

2. Model Architectures and Region-Aware Representations

(a) Single-Model Diffusion Approaches

“HeadsUp! High-Fidelity Portrait Image Super-Resolution” (Li et al., 10 Oct 2025) advances a unified backbone for HTB-SR. The model is a one-step latent diffusion network starting from Stable Diffusion v2.1, incorporating LoRA adapters into both the VAE encoder and the noise-prediction UNet. Special conditioning occurs via:

  • Downsampled binary face mask MM, identifying the head area.
  • (Optional) reference image xrx_r to guide identity restoration for the head. Input configuration widens the UNet’s first convolution from 4 to 9 channels, concatenating latent representations for both xLx_L (low-quality input), xrx_r, and the mask MM.

A single-step DDIM update jointly restores all regions:

z^H=zLβϵ~θ(zL,zr,M)α\hat{z}_H = \frac{z_L - \beta \, \tilde{\epsilon}_\theta(z_L, z_r, M^\downarrow)}{\alpha}

Decoding yields the restored image x^H=D(z^H)\hat{x}_H = D(\hat{z}_H), with region-awareness reinforced by mask and reference conditioning.

(b) Parallel Branch and Fusion Mechanisms

The HTB-SR module in “Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis” (Ye et al., 2024) employs three parallel branches:

  • Head Super-Resolution: 4 upsampling blocks (nearest-neighbor, Conv2D, LeakyReLU), upscaling the low-res head.
  • Torso Warping: Torso appearance encoder with 3 Conv2D layers; motion derived from 68 keypoints (KP) sampled from a fitted 3DMM. Dense Motion Estimator (DME) and Deformation-Based Decoder (DBD) yield translated torso features.
  • Background Inpainting: K-nearest neighbor fill for missing pixels in the background segment, followed by a 3-layer convolutional appearance encoder.

Fusion proceeds via learned alpha-blending masks to ensure sharp seams:

Ffuse(x,y)=(FheadMhead+Ftorso(1Mhead))Mperson+Fbg(1Mperson)F_{\text{fuse}}(x,y) = \Bigl(F_{\text{head}} M_{\text{head}} + F_{\text{torso}} (1-M_{\text{head}})\Bigr) M_{\text{person}} + F_{\text{bg}} (1-M_{\text{person}})

where Mperson=MheadMtorsoM_{\text{person}} = M_{\text{head}}\lor M_{\text{torso}}.

3. Loss Functions and Supervision Strategies

Composite loss function (Eq. 2):

θ=argminθE(xL,xH,xr)S[LI(xH,Gθ(xL))+LF(Ω(xH),Ω(Gθ(xL)),xr)+Lreg(Gθ(xL))]\theta^* = \arg\min_\theta \mathbb{E}_{(x_L,x_H,x_r)\sim S} \big[ L_I(x_H, G_\theta(x_L)) + L_F(\Omega(x_H), \Omega(G_\theta(x_L)), x_r) + L_{\text{reg}}(G_\theta(x_L)) \big]

  • LIL_I: Global image loss (MSE + LPIPS + VSD) over the full image, supervising head, torso, and background jointly.
  • LFL_F: Face-region loss, combining region-specific fidelity (MSE + LPIPS), identity (ArcFace embedding cosine similarity, with optional reference), and GAN-based adversarial term.
  • LregL_{\text{reg}}: LoRA weight decay and network smoothness regularization.

Notably, torso and background supervision is implicit via the global losses—no additional mask or region-specific loss is deployed.

The HTB-SR branch is trained on the full 512×512512\times 512 video frame:

LHTB=λ1ItgtIout1+λ2ϕ(Itgt)ϕ(Iout)1+LDualAdv(Iout)\mathcal{L}_{\mathrm{HTB}} = \lambda_1 \|I_{\mathrm{tgt}} - I_{\mathrm{out}}\|_1 + \lambda_2 \|\phi(I_{\mathrm{tgt}}) - \phi(I_{\mathrm{out}})\|_1 + \mathcal{L}_{\mathrm{DualAdv}}(I_{\mathrm{out}})

  • Pixel-wise L1L_1 loss.
  • Perceptual loss uses activations from both VGG-19 and VGGFace (ϕ()\phi(\cdot)).
  • Dual adversarial discriminators (coarse and fine) following EG3D for realism and multi-view consistency.

4. Datasets, Training Protocols, and Evaluation Metrics

  • Dataset: PortraitSR-4K, 30,000 in-the-wild “portrait” images (3840×2160\geq 3840\times 2160), aspect ratio in [0.6,1.6][0.6, 1.6].
  • Face Mask Generation: FaceLib detector, canonical alignment via affine warp.
  • Training: 27,000 portraits, yielding 163,000 triplets (degraded input, high-quality target, reference image if available).
  • Testing: 3,000 portraits, 190 with valid reference.

Metrics:

  • Full-image (HTB total): PSNR, SSIM, LPIPS, DISTS, FID, NIQE, MUSIQ, MANIQA.
  • Identity-score (mean ArcFace cosine ϕ\phi over aligned heads).
  • User-study win-rate on identity (WR_id) and face naturalness (WR_N).
  • Implicit torso/background quality is captured via global metrics under LIL_I.
  • Dataset: CelebV-HQ (35,666 video clips at 512×512512\times 512).
  • Preprocessing: Face parsing for masks; 3DMM fitting for keypoints.
  • Training: All prior modules (I2P, motion adapter, volume renderer) frozen.
  • Evaluation: Same-identity video-driven setting.

Metrics:

  • CSIM (ArcFace cosine similarity), FID, AED (average expression distance), APD (average pose distance).

5. Effects of Architecture and Ablation Studies

HTB-SR module contribution in Real3D-Portrait is demonstrated through ablation (Table 4) (Ye et al., 2024):

Method CSIM FID AED APD
Full HTB-SR 0.758 42.37 0.138 0.022
w/o Background Inpainting 0.744 43.95
w/ Concatenation (no alpha) 0.737 46.38
w/ Unsupervised KP 0.746 44.86

This demonstrates that alpha-blending, explicit background inpainting, and predefined keypoints for the torso are crucial for optimal region-aware super-resolution and artifact-free synthesis.

HeadsUp ablations (Li et al., 10 Oct 2025) show that region-aware face losses are essential for balancing fidelity and identity:

Loss Module PSNR ID-score NIQE FID
No face loss 25.07 0.26
+ LfidfL_{fid}^f 25.65 0.30
+ LidfL_{id}^f 25.85 0.43 5.72
+ Both face losses 25.74 0.36 99.55
+ Reference xrx_r 0.37

6. Region-Specific Qualitative Outcomes and Control

HTB-SR methods produce:

  • Natural face synthesis and identity preservation: HeadsUp achieves seamless restoration of facial regions without visible seams or boundary artifacts, outperforming prior blending pipelines (Li et al., 10 Oct 2025).
  • Realistic torso motion and background control: Real3D-Portrait supports ‘switchable’ backgrounds at inference by swapping, and yields convincing head-torso transitions even under extreme poses (Ye et al., 2024).
  • Boundary sharpness: Alpha-blending fusion eliminates hollow-hair and transition blur prevalent with naive concatenation (Ye et al., 2024).
  • Failure modes: Isolated face-only loss (identity or fidelity) leads to over-smooth outputs or reduced identity scores, necessitating composite region-aware objectives (Li et al., 10 Oct 2025).

7. Significance, Applications, and Limitations

HTB-SR is foundational for:

  • High-fidelity social media portrait restoration at multi-region scale.
  • One-shot realistic talking-portrait video synthesis with controllable backgrounds, central in avatar generation (Ye et al., 2024).
  • End-to-end architectures that mitigate artifacts and domain shift intrinsic to region-blending approaches.

A plausible implication is that further architectural integration of explicit region cues (mask, reference, keypoints) with unified training pipelines is crucial for seamless multi-region super-resolution. Direct region losses are essential for perceptually sensitive head areas, whereas torso and background may rely on image-level objectives, yet fusion mechanisms (such as alpha-blending) remain indispensable for artifact-free outputs.

HTB-SR does not currently report standalone torso/background benchmarks, as full-image metrics suffice to assess their restoration quality implicitly (Li et al., 10 Oct 2025). This suggests future work may include explicit regionwise breakdowns or augmentations for more granular evaluation.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Head-Torso-Background Super-Resolution (HTB-SR).