Papers
Topics
Authors
Recent
Search
2000 character limit reached

LIA-X: Interpretable Latent Portrait Animator

Published 13 Aug 2025 in cs.CV | (2508.09959v1)

Abstract: We introduce LIA-X, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpretable factors. Deviating from previous 'warp-render' approaches, the interpretability of the Sparse Motion Dictionary allows LIA-X to support a highly controllable 'edit-warp-render' strategy, enabling precise manipulation of fine-grained facial semantics in the source portrait. This helps to narrow initial differences with the driving video in terms of pose and expression. Moreover, we demonstrate the scalability of LIA-X by successfully training a large-scale model with approximately 1 billion parameters on extensive datasets. Experimental results show that our proposed method outperforms previous approaches in both self-reenactment and cross-reenactment tasks across several benchmarks. Additionally, the interpretable and controllable nature of LIA-X supports practical applications such as fine-grained, user-guided image and video editing, as well as 3D-aware portrait video manipulation.

Summary

  • The paper introduces a scalable autoencoder featuring a Sparse Motion Dictionary for interpretable and controlled facial motion transfer.
  • It employs an edit-warp-render strategy that enables precise manipulation of head pose and facial expressions in both self- and cross-reenactment tasks.
  • Comparative evaluations show superior performance over state-of-the-art methods, with scalability to nearly 1B parameters on large-scale datasets.

LIA-X: Interpretable Latent Portrait Animator

Introduction

The LIA-X framework introduces a scalable, interpretable approach to portrait animation, enabling fine-grained control over facial dynamics transferred from a driving video to a source portrait. Unlike prior methods that rely on explicit structure representations or dense, entangled latent codes, LIA-X leverages a Sparse Motion Dictionary within an autoencoder architecture to disentangle and control facial motion semantics. This design supports an "edit-warp-render" pipeline, allowing for precise alignment and manipulation of head pose and facial expression prior to animation. The model is demonstrated to scale to approximately 1 billion parameters and is trained on a diverse, large-scale dataset, achieving superior performance in both self-reenactment and cross-reenactment tasks. Figure 1

Figure 1: Overview of LIA-X architecture, highlighting the encoder, optical-flow generator, rendering network, and the sparsity-constrained motion dictionary.

Methodology

Architecture and Sparse Motion Dictionary

LIA-X builds upon the Latent Image Animator (LIA) autoencoder, comprising an encoder EE, an optical-flow generator GfG_f, and a rendering network GrG_r. The key innovation is the Sparse Motion Dictionary DmD_m, a set of motion vectors with enforced sparsity via an L1L_1 penalty on the motion coefficients Ar→d\mathcal{A}_{r\rightarrow d}. This constraint encourages the model to reconstruct each driving image using a minimal subset of motion vectors, promoting disentanglement and interpretability of facial dynamics.

The animation process is formalized as linear navigation in latent space:

zs→d=zs→r+∑i=1Maidiz_{s\rightarrow d} = z_{s\rightarrow r} + \sum_{i=1}^M a_i \mathbf{d}_i

where zs→rz_{s\rightarrow r} is the transformation from source to reference image, and aia_i are the sparse coefficients for each motion vector di\mathbf{d}_i. Figure 2

Figure 2: Sparsity analysis comparing activations of motion vectors with and without the sparsity constraint, demonstrating selective activation in LIA-X.

Edit-Warp-Render Strategy

The interpretable nature of the Sparse Motion Dictionary enables a controllable "edit-warp-render" pipeline. Users can manipulate the source portrait by adjusting specific motion vectors to align pose and expression with the driving frame before applying motion transfer. This is particularly effective in cross-identity scenarios with large initial discrepancies.

Training and Scalability

LIA-X is trained in a self-supervised manner using a composite loss: L1L_1 reconstruction, VGG-based perceptual, adversarial, and sparsity penalties. The architecture incorporates advanced residual blocks inspired by StyleGAN-T, facilitating scalability to 1B parameters. Training utilizes a mixture of public and internal datasets, totaling 0.5M sequences and 94M frames, with gradient accumulation on 8 A100 GPUs.

Interpretability and Controllability

Empirical analysis reveals that the sparse motion vectors correspond to human-interpretable facial attributes and 3D-aware transformations. Manipulating individual vectors enables control over yaw, pitch, roll, and fine-grained semantic attributes such as mouth, eyes, and eyebrows. Figure 3

Figure 3

Figure 3

Figure 3: LIA-X enables 3D-aware portrait manipulation (yaw, pitch, roll) by adjusting corresponding motion vectors.

Figure 4

Figure 4: Image editing capabilities of LIA-X, demonstrating control over fine-grained semantic attributes via motion vector manipulation.

The disentanglement achieved by the sparse dictionary allows for compositional editing, supporting complex user-guided image and video manipulations without explicit 3D representations. Figure 5

Figure 5

Figure 5

Figure 5: 3D-aware video manipulation, showing seamless rotation of head pose in real-world video while preserving identity.

Comparative Evaluation

LIA-X is evaluated against state-of-the-art GAN-based and diffusion-based methods (FOMM, TPS, DaGAN, MCNet, X-Portrait, LivePortrait) on self-reenactment and cross-reenactment tasks. Quantitative metrics (L1, LPIPS, SSIM, PSNR, FID, Identity Similarity, Image Quality) consistently favor LIA-X, with notable improvements in high-resolution (512×512512\times512) settings and challenging cross-identity scenarios. Figure 6

Figure 6: Qualitative comparison on cross-reenactment, illustrating LIA-X's superior ability to handle large pose and expression variations via pre-animation editing.

Scaling analysis demonstrates that increasing model size from 0.05B to 0.9B parameters yields performance gains, though improvements saturate beyond 0.3B, suggesting dataset size as a limiting factor for further scaling.

Practical and Theoretical Implications

LIA-X's interpretable latent space and controllable animation pipeline have direct applications in entertainment, e-education, digital human creation, and user-guided editing. The framework's scalability and efficient inference (relative to diffusion models) position it as a practical complement to current generative approaches. The sparse dictionary paradigm may inform future research in disentangled representation learning and interpretable generative modeling.

Theoretically, LIA-X demonstrates that sparse coding principles can be effectively integrated into large-scale, self-supervised generative models to achieve both scalability and interpretability. The linear navigation formulation provides a unified framework for motion transfer, editing, and 3D-aware manipulation.

Limitations and Future Directions

Current limitations include fixed resolution support and reliance on convolutional architectures, which may constrain further scaling. Future work should explore dynamic resolution techniques and transformer-based architectures (e.g., DiT) for enhanced scalability and generalization. Expanding the training dataset could further improve model capacity and performance.

Conclusion

LIA-X advances portrait animation by integrating a Sparse Motion Dictionary into a scalable autoencoder, enabling interpretable, controllable, and high-quality motion transfer. The framework outperforms prior methods across benchmarks and supports diverse editing applications. Its design principles—sparse, disentangled latent codes and scalable architectures—are likely to influence future research in interpretable video generation and user-controllable generative models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.

alphaXiv