Context-aware Talking Face Video Generation
Abstract: In this paper, we consider a novel and practical case for talking face video generation. Specifically, we focus on the scenarios involving multi-people interactions, where the talking context, such as audience or surroundings, is present. In these situations, the video generation should take the context into consideration in order to generate video content naturally aligned with driving audios and spatially coherent to the context. To achieve this, we provide a two-stage and cross-modal controllable video generation pipeline, taking facial landmarks as an explicit and compact control signal to bridge the driving audio, talking context and generated videos. Inside this pipeline, we devise a 3D video diffusion model, allowing for efficient contort of both spatial conditions (landmarks and context video), as well as audio condition for temporally coherent generation. The experimental results verify the advantage of the proposed method over other baselines in terms of audio-video synchronization, video fidelity and frame consistency.
- Image2stylegan++: How to edit the embedded images? In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8293–8302, 2020.
- Restyle: A residual-based stylegan encoder via iterative refinement. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6691–6700, 2021.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18208–18218, 2022.
- Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models, 2022.
- One transformer fits all distributions in multi-modal diffusion at scale. In Proceedings of the 40th International Conference on Machine Learning, pages 1692–1717. PMLR, 2023.
- Text2live: Text-driven layered image and video editing. In Computer Vision – ECCV 2022, pages 707–723, Cham, 2022. Springer Nature Switzerland.
- A Morphable Model For The Synthesis Of 3D Faces. Association for Computing Machinery, New York, NY, USA, 1 edition, 2023.
- How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision, 2017.
- Stablevideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23040–23050, 2023.
- Control-a-video: Controllable text-to-video generation with diffusion models, 2023.
- Speech-driven facial animation using cascaded gans for learning of motion and texture. In Computer Vision – ECCV 2020, pages 408–424, Cham, 2020. Springer International Publishing.
- Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, pages 8780–8794. Curran Associates, Inc., 2021.
- Diffusionrig: Learning personalized priors for facial appearance editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12736–12746, 2023.
- Headgan: One-shot neural head synthesis and editing, 2021.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, 2021.
- Structure and content-guided video synthesis with diffusion models, 2023.
- M3l: Language-based video editing via multi-modal multi-level transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10513–10522, 2022.
- Ganspace: Discovering interpretable gan controls. In Advances in Neural Information Processing Systems, pages 9841–9850. Curran Associates, Inc., 2020.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851. Curran Associates, Inc., 2020.
- Video diffusion models. arXiv:2204.03458, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022.
- Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet, 2023.
- Explaining in style: Training a gan to explain a classifier in stylespace. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 673–682, 2021.
- Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation, 2023.
- Video-p2p: Video editing with cross-attention control, 2023.
- SSD: Single Shot MultiBox Detector, page 21–37. Springer International Publishing, 2016.
- Semantic-aware implicit neural audio-driven video portrait generation, 2022.
- Layered neural rendering for retiming people in video. In SIGGRAPH Asia, 2020.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11461–11471, 2022.
- Follow your pose: Pose-guided text-to-video generation using pose-free videos, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In Computer Vision – ECCV 2020, pages 405–421, Cham, 2020. Springer International Publishing.
- NÜwa-lip: Language-guided image inpainting with defect-free vqgan. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14183–14192, 2023.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 2020.
- Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15932–15942, 2023.
- Dancing avatar: Pose and text-guided human motion videos synthesis with image diffusion model, 2023.
- Hierarchical text-conditional image generation with clip latents, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
- Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10219–10228, 2023.
- Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, 2020. Version 0.3.0.
- Learning dynamic facial radiance fields for few-shot talking head synthesis. In Computer Vision – ECCV 2022, pages 666–682, Cham, 2022. Springer Nature Switzerland.
- Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1982–1991, 2023.
- Make-a-video: Text-to-video generation without text-video data, 2022.
- Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 17:585–598, 2022.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- Neural voice puppetry: Audio-driven facial reenactment, 2020.
- Designing an encoder for stylegan image manipulation. CoRR, abs/2102.02766, 2021.
- Imagen editor and editbench: Advancing and evaluating text-guided image inpainting, 2023.
- One-shot free-view neural talking-head synthesis for video conferencing. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10034–10044, 2021.
- Smartbrush: Text and shape guided object inpainting with diffusion model. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22428–22437, 2023.
- Unbiased multi-modality guidance for image inpainting. In Computer Vision – ECCV 2022, pages 668–684, Cham, 2022. Springer Nature Switzerland.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023a.
- Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8652–8661, 2023b.
- Davd-net: Deep audio-aided video decompression of talking heads. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12335–12344, 2020.
- Controlvideo: Training-free controllable text-to-video generation, 2023c.
- Magicvideo: Efficient video generation with latent diffusion models, 2023.
- Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186, 2021.
- Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6):1–15, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.