Continuously Controllable Facial Expression Editing in Talking Face Videos
Abstract: Recently audio-driven talking face video generation has attracted considerable attention. However, very few researches address the issue of emotional editing of these talking face videos with continuously controllable expressions, which is a strong demand in the industry. The challenge is that speech-related expressions and emotion-related expressions are often highly coupled. Meanwhile, traditional image-to-image translation methods cannot work well in our application due to the coupling of expressions with other attributes such as poses, i.e., translating the expression of the character in each frame may simultaneously change the head pose due to the bias of the training data distribution. In this paper, we propose a high-quality facial expression editing method for talking face videos, allowing the user to control the target emotion in the edited video continuously. We present a new perspective for this task as a special case of motion information editing, where we use a 3DMM to capture major facial movements and an associated texture map modeled by a StyleGAN to capture appearance details. Both representations (3DMM and texture map) contain emotional information and can be continuously modified by neural networks and easily smoothed by averaging in coefficient/latent spaces, making our method simple yet effective. We also introduce a mouth shape preservation loss to control the trade-off between lip synchronization and the degree of exaggeration of the edited expression. Extensive experiments and a user study show that our method achieves state-of-the-art performance across various evaluation criteria.
- L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in CVPR, 2019, pp. 7832–7841.
- J. Thies, M. Elgharib, A. Tewari, C. Theobalt, and M. Nießner, “Neural voice puppetry: Audio-driven facial reenactment,” in ECCV (16), vol. 12361, 2020, pp. 716–731.
- K. Vougioukas, S. Petridis, and M. Pantic, “Realistic speech-driven facial animation with gans,” Int. J. Comput. Vis., vol. 128, no. 5, pp. 1398–1413, 2020.
- R. Yi, Z. Ye, Z. Sun, J. Zhang, G. Zhang, P. Wan, H. Bao, and Y.-J. Liu, “Predicting personalized head movement from short video and speech signal,” IEEE Trans. Multimedia, pp. 1–13, 2022.
- K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in ACM Multimedia, 2020, pp. 484–492.
- A. Lahiri, V. Kwatra, C. Früh, J. Lewis, and C. Bregler, “Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization,” in CVPR, 2021, pp. 2755–2764.
- X. Ji, H. Zhou, K. Wang, W. Wu, C. C. Loy, X. Cao, and F. Xu, “Audio-driven emotional video portraits,” in CVPR, 2021, pp. 14 080–14 089.
- Z. Ye, Z. Sun, Y. Wen, Y. Sun, T. Lv, R. Yi, and Y. Liu, “Dynamic neural textures: Generating talking-face videos with continuously controllable expressions,” CoRR, vol. abs/2204.06180, 2022.
- F. P. Papantoniou, P. P. Filntisis, P. Maragos, and A. Roussos, “Neural emotion director: Speech-preserving semantic control of facial expressions in ”in-the-wild” videos,” in CVPR. IEEE, 2022, pp. 18 759–18 768.
- P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in CVPR, 2017, pp. 5967–5976.
- J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, 2017, pp. 2242–2251.
- Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in CVPR, 2018, pp. 8789–8797.
- Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the latent space of gans for semantic face editing,” in CVPR, 2020, pp. 9240–9249.
- S. Tripathy, J. Kannala, and E. Rahtu, “Icface: Interpretable and controllable face reenactment using gans,” in WACV. IEEE, 2020, pp. 3374–3383.
- T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, 2019, pp. 4401–4410.
- T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in CVPR, 2020, pp. 8107–8116.
- E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: A stylegan encoder for image-to-image translation,” in CVPR, 2021, pp. 2287–2296.
- O. Tov, Y. Alaluf, Y. Nitzan, O. Patashnik, and D. Cohen-Or, “Designing an encoder for stylegan image manipulation,” ACM Trans. Graph., vol. 40, no. 4, pp. 133:1–133:14, 2021.
- Y. Alaluf, O. Patashnik, and D. Cohen-Or, “Restyle: A residual-based stylegan encoder via iterative refinement,” in ICCV, 2021, pp. 6691–6700.
- V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in SIGGRAPH, 1999, pp. 187–194.
- P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3d face model for pose and illumination invariant face recognition,” in AVSS, 2009, pp. 296–301.
- C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “Facewarehouse: A 3d facial expression database for visual computing,” IEEE Trans. Vis. Comput. Graph., vol. 20, no. 3, pp. 413–425, 2014.
- T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4d scans,” ACM Trans. Graph., vol. 36, no. 6, pp. 194:1–194:17, 2017.
- B. Egger, W. A. P. Smith, A. Tewari, S. Wuhrer, M. Zollhöfer, T. Beeler, F. Bernard, T. Bolkart, A. Kortylewski, S. Romdhani, C. Theobalt, V. Blanz, and T. Vetter, “3d morphable face models - past, present, and future,” ACM Trans. Graph., vol. 39, no. 5, pp. 157:1–157:38, 2020.
- P. Garrido, M. Zollhöfer, D. Casas, L. Valgaerts, K. Varanasi, P. Pérez, and C. Theobalt, “Reconstruction of personalized 3d face rigs from monocular video,” ACM Trans. Graph., vol. 35, no. 3, pp. 28:1–28:15, 2016.
- L. Jiang, J. Zhang, B. Deng, H. Li, and L. Liu, “3d face reconstruction with geometry details from a single image,” IEEE Trans. Image Process., vol. 27, no. 10, pp. 4756–4770, 2018.
- Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong, “Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set,” in CVPR Workshops, 2019, pp. 285–295.
- J. Zhang, L. Lin, J. Zhu, and S. C. H. Hoi, “Weakly-supervised multi-face 3d reconstruction,” CoRR, vol. abs/2101.02000, 2021.
- Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,” ACM Trans. Graph., vol. 40, no. 4, pp. 88:1–88:13, 2021.
- H. Ding, K. Sricharan, and R. Chellappa, “Exprgan: Facial expression editing with controllable expression intensity,” in AAAI, 2018, pp. 6781–6788.
- Z. Geng, C. Cao, and S. Tulyakov, “3d guided fine-grained face manipulation,” in CVPR, 2019, pp. 9821–9830.
- A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H. Seidel, P. Pérez, M. Zollhöfer, and C. Theobalt, “Stylerig: Rigging stylegan for 3d control over portrait images,” in CVPR, 2020, pp. 6141–6150.
- S. d’Apolito, D. P. Paudel, Z. Huang, A. Romero, and L. V. Gool, “Ganmut: Learning interpretable conditional space for gamut of emotions,” in CVPR. Computer Vision Foundation / IEEE, 2021, pp. 568–577.
- Z. C. Lipton and S. Tripathi, “Precise recovery of latent vectors from generative adversarial networks,” in ICLR (Workshop). OpenReview.net, 2017.
- R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan: How to embed images into the stylegan latent space?” in ICCV, 2019, pp. 4431–4440.
- ——, “Image2stylegan++: How to edit the embedded images?” in CVPR, 2020, pp. 8293–8302.
- T. Wang, Y. Zhang, Y. Fan, J. Wang, and Q. Chen, “High-fidelity GAN inversion for image attribute editing,” in CVPR. IEEE, 2022, pp. 11 369–11 378.
- L. Theis, A. van den Oord, and M. Bethge, “A note on the evaluation of generative models,” in Proc. Int. Conf. Learn. Representations, 2016, pp. 1–10.
- V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. C. Courville, “Adversarially learned inference,” in Proc. Int. Conf. Learn. Representations, 2017, pp. 1–13.
- E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris, “Ganspace: Discovering interpretable GAN controls,” 2020.
- Y. Shen and B. Zhou, “Closed-form factorization of latent semantics in gans,” in CVPR, 2021, pp. 1532–1540.
- L. Ma and Z. Deng, “Real-time facial expression transformation for monocular RGB video,” Comput. Graph. Forum, vol. 38, no. 1, pp. 470–481, 2019.
- I. Magnusson, A. Sankaranarayanan, and A. Lippman, “Invertable frowns: Video-to-video facial emotion translation,” in ADGD @ ACM Multimedia, 2021, pp. 25–33.
- G. K. Solanki and A. Roussos, “Deep semantic manipulation of facial videos,” in ECCV Workshops (6), ser. Lecture Notes in Computer Science, vol. 13806. Springer, 2022, pp. 104–120.
- F.-L. Liu, S.-Y. Chen, Y.-K. Lai, C. Li, Y.-R. Jiang, H. Fu, and L. Gao, “DeepFaceVideoEditing: Sketch-based deep editing of face videos,” ACM Transactions on Graphics, vol. 41, no. 4, pp. 167:1–167:16, 2022.
- T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” in Proc. NeurIPS, 2021, pp. 852–863.
- L. Chen, G. Cui, Z. Kou, H. Zheng, and C. Xu, “What comprises a good talking-head video generation?: A survey and benchmark,” vol. abs/2005.03201, 2020.
- T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,” ACM Trans. Graph., vol. 36, no. 4, pp. 94:1–94:12, 2017.
- M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in ICML, vol. 70, 2017, pp. 214–223.
- I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” pp. 5767–5777, 2017.
- J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in CVPR, 2019, pp. 4690–4699.
- K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y. Qiao, and C. C. Loy, “MEAD: A large-scale audio-visual dataset for emotional talking-face generation,” in ECCV (21), vol. 12366, 2020, pp. 700–717.
- S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PLOS ONE, vol. 13, no. 5, pp. 1–35, 05 2018.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” pp. 6626–6637, 2017.
- A. V. Savchenko, “Facial expression and attributes recognition based on multi-task learning of lightweight neural networks,” in SISY, 2021, pp. 119–124.
- A. Mollahosseini, B. Hassani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 18–31, 2019.
- N. D. Narvekar and L. J. Karam, “A no-reference image blur metric based on the cumulative probability of blur detection (CPBD),” IEEE Trans. Image Process., vol. 20, no. 9, pp. 2678–2683, 2011.
- J. S. Chung and A. Zisserman, “Out of time: Automated lip sync in the wild,” in ACCV Workshops (2), vol. 10117, 2016, pp. 251–263.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.