我们提出了一个基于2D关键点驱动的3D面部动画框架(2D landmark-driven 3D facial animation framework) ,无需使用3D面部数据集进行训练。我们的方法将3D面部头像分解为几何(geometry)和纹理(texture)部分。在给定2D关键点作为输入的情况下 ,我们的模型学习估计FLAME的参数,并将目标纹理转换为不同的面部表情。实验结果表明,我们的方法取得了显著的成果。通过使用2D关键点作为输入数据,我们的方法有潜力在获取完整RGB面部图像有困难的场景下进行部署(例如被VR头戴显示器遮挡)。
CCS概念
• 计算方法学 -> 动画
关键词
面部动画、3D头像、可塑模型
ACM参考文献格式
Pu Ching, Hung-Kuo Chu, and Min-Chun Hu. 2023. SOFA: Style-based One-shot 3D Facial Animation Driven by 2D landmarks. In International Conference on Multimedia Retrieval (ICMR '23), June 12--15, 2023, Thessaloniki, Greece. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3591106.3592291
控制(controlling)关键点图(landmark map) L ′ L' L′可以通过基于现成的关键点预测器 E L E_L EL从源面部图像 I ′ I' I′(或者从捕获被部分遮挡的面部图像的额外近红外图像)中获取。
对于给定的关键点图 L ′ L' L′,可以通过面部几何回归器 E R E_R ER来预测面部参数。
同时,根据之前描述的单示例设置,给定完整的用户面部图像 I 0 I_0 I0,预先训练的虚拟角色估计器 E T E_T ET用于估计用户的初始面部纹理 T 0 T_0 T0,关键点预测器 E L E_L EL被应用于获取用户的初始关键点图 L 0 L_0 L0。
我们提出了一个基于风格的纹理转换器 S T S_T ST,用于根据给定的关键点图 L 0 L_0 L0和 L ′ L' L′,将初始面部纹理 T 0 T_0 T0变形为目标纹理 T ′ T' T′,这是通过计算给定的关键点图 L 0 L_0 L0和 L ′ L' L′之间的残差信息 Δ S \Delta{S} ΔS得到的。
最后,对于每个源帧 I ′ I' I′,结合面部参数和纹理 T ′ T' T′,使用虚拟角色生成器 D A D_A DA生成最终的虚拟角色 Y Y Y。
2.1 几何回归器
几何回归器(Geometry Regressor,图中 E R E_R ER)
使用单视角图像直接合成整个以顶点表示的3D面部模型是一个非常复杂的问题。
受先前工作的启发,我们采用FLAME [8]作为可塑模型,它需要三种参数:
姿态 θ \theta θ
表情 ψ \psi ψ
形状 β \beta β
来生成3D面部网格。与对整个面部进行复杂几何建模相比,使用像FLAME这样的可塑模型有一个优势,即具有较低自由度的表示,使我们能够设计一个轻量级的几何回归器 E R E_R ER来估计FLAME参数并实时生成虚拟角色。
几何回归器 E R E_R ER仅估计姿态参数 θ ′ \theta' θ′和表情参数 ψ ′ \psi' ψ′。
形状参数 β ′ \beta' β′由虚拟角色估计器 E T E_T ET根据完整的用户面部图像 I 0 I_0 I0进行估计。
在第3.3节中,我们将展示当几何回归器 E R E_R ER不包含形状参数进行回归时效果更好。
2.2 基于风格的纹理转换器
基于风格的纹理转换器(Style-based Texture Translator,图中 S T S_T ST)
基于风格的纹理转换器 S T S_T ST接收一个风格编码 Δ S \Delta{S} ΔS,它是关键点图的残差信息,用于估计动画纹理映射 T ′ T' T′。
我们通过映射网络 M M M从2D关键点图 L ′ L' L′中提取信息,输出 S ′ S' S′包含主体身份和源表情的信息。
类似地,映射网络 M M M被应用于2D关键点图 L 0 L_0 L0,以提取 S 0 S_0 S0,其中包含主体身份和中性表情的信息。
为了减少对主体身份的依赖并仅保留表情信息,我们将 S ′ S' S′和 S 0 S_0 S0的残差作为风格编码,即:
Δ S = S ′ − S 0 (1) \Delta{S}=S'-S_0\tag{1} ΔS=S′−S0(1)
如图2(b)所示,纹理转换器 S T S_T ST由 N N N个编码块, { E i } i = 1 N \lbrace{E_i}\rbrace^N_{i=1} {Ei}i=1N,和 N N N个基于风格的堆叠扭曲(style-based stacked warping)块, { D i } i = 1 N \lbrace{D_i}\rbrace^N_{i=1} {Di}i=1N,组成,其中的跳跃连接(skip-connection)类似U-net架构。
在给定 Δ S \Delta{S} ΔS的条件下,每个基于风格的堆叠扭曲块 D i D_i Di将前一层的输出特征 D i + 1 D_{i+1} Di+1和 E i E_i Ei作为输入。
更具体地说,每个扭曲块 D i D_i Di是一个具有调制卷积层的StyleGAN生成器,其公式为:
f D i = U p s a m p l e ( c o n v m ( D i ( f D i + 1 , f E i ) , Δ S ) ) (2) f_{D_i}=Upsample(convm(D_i(f_{D_{i+1}},f_{E_i}),\Delta{S}))\tag{2} fDi=Upsample(convm(Di(fDi+1,fEi),ΔS))(2)
请注意, f D 0 f_{D_0} fD0是最终的动画纹理映射 T ′ T' T′。
在第3.2节中,我们验证了将PixelShuffle [14]作为上采样过程应用,相较于使用反卷积层,可以提高细粒度生成质量。通过在不同感受野中提供风格编码,纹理转换器 S T S_T ST能够生成具有特定风格的全局表示。
[11] Safa C Medin,Bernhard Egger,Anoop Cherian,Ye Wang,Joshua B Tenenbaum,Xiaoming Liu和Tim K Marks。2022年。MOST-GAN:用于解缠面部图像操作的3D可塑StyleGAN。《AAAI人工智能会议论文集》。
[12] Moustafa Meshry,Saksham Suri,Larry S Davis和Abhinav Shrivastava。2021年。学习用于少样本说话头合成的空间表示。《IEEE / CVF计算机视觉国际会议论文集》。
[13] Alexander Richard,Colin Lea,Shugao Ma,Jurgen Gall,Fernando De la Torre和Yaser Sheikh。2021年。编解码器角色的音频和凝视驱动面部动画。《IEEE / CVF冬季计算机视觉应用会议论文集》。
[1] Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques.
[2] Chen Cao, Tomas Simon, Jin Kyu Kim, Gabe Schwartz, Michael Zollhoefer, Shun-Suke Saito, Stephen Lombardi, Shih-En Wei, Danielle Belko, Shoou-I Yu, et al. 2022. Authentic volumetric avatars from a phone scan. ACM Transactions on Graphics (TOG) (2022).
[3] Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. 2021. Learning an animatable detailed 3D face model from in-the-wild images. ACM Transactions on Graphics (ToG) (2021).
[4] Kuangxiao Gu, Yuqian Zhou, and Thomas Huang. 2020. Flnet: Landmark-driven fetching and learning network for faithful talking facial animation synthesis. In Proceedings of the AAAI conference on artificial intelligence.
[5] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. CVPR (2017).
[6] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer.
[7] Reinhard Knothe, Brian Amberg, Sami Romdhani, Volker Blanz, and Thomas Vetter. 2011. Morphable Models of Faces. In Handbook of Face Recognition. Springer.
[8] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. (2017).
[9] Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep appearance models for face rendering. ACM Transactions on Graphics (ToG) (2018).
[10] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).
[11] Safa C Medin, Bernhard Egger, Anoop Cherian, Ye Wang, Joshua B Tenenbaum, Xiaoming Liu, and Tim K Marks. 2022. MOST-GAN: 3D morphable StyleGAN for disentangled face image manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence.
[12] Moustafa Meshry, Saksham Suri, Larry S Davis, and Abhinav Shrivastava. 2021. Learned Spatial Representations for Few-shot Talking-Head Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[13] Alexander Richard, Colin Lea, Shugao Ma, Jurgen Gall, Fernando De la Torre, and Yaser Sheikh. 2021. Audio-and gaze-driven facial animation of codec avatars. In Proceedings of the IEEE/CVF winter conference on applications of computer vision.
[14] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[15] Jiale Tao, Biao Wang, Borun Xu, Tiezheng Ge, Yuning Jiang, Wen Li, and Lixin Duan. 2022. Structure-Aware Motion Transfer with Deformable Anchor Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[16] Shih-En Wei, Jason Saragih, Tomas Simon, Adam W Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino, and Yaser Sheikh. 2019. VR facial animation via multiview image translation. ACM Transactions on Graphics (TOG) (2019).
[17] Zili Yi, Qiang Tang, Vishnu Sanjay Ramiya Srinivasan, and Zhan Xu. 2020. Animating through warping: An efficient method for high-quality facial expression animation. In Proceedings of the 28th ACM international conference on multimedia.
[18] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision.
[19] Ruiqi Zhao, Tianyi Wu, and Guodong Guo. 2021. Sparse to dense motion transfer for face image animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision.
[19] Ruiqi Zhao, Tianyi Wu, and Guodong Guo. 2021. Sparse to dense motion transfer for face image animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.