我们提出了一个基于2D关键点驱动的3D面部动画框架(2D landmark-driven 3D facial animation framework) ,无需使用3D面部数据集进行训练。我们的方法将3D面部头像分解为几何(geometry)和纹理(texture)部分。在给定2D关键点作为输入的情况下 ,我们的模型学习估计FLAME的参数,并将目标纹理转换为不同的面部表情。实验结果表明,我们的方法取得了显著的成果。通过使用2D关键点作为输入数据,我们的方法有潜力在获取完整RGB面部图像有困难的场景下进行部署(例如被VR头戴显示器遮挡)。
CCS概念
• 计算方法学 -> 动画
关键词
面部动画、3D头像、可塑模型
ACM参考文献格式
Pu Ching, Hung-Kuo Chu, and Min-Chun Hu. 2023. SOFA: Style-based One-shot 3D Facial Animation Driven by 2D landmarks. In International Conference on Multimedia Retrieval (ICMR '23), June 12--15, 2023, Thessaloniki, Greece. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3591106.3592291
控制(controlling)关键点图(landmark map) L ′ L' L′可以通过基于现成的关键点预测器 E L E_L EL从源面部图像 I ′ I' I′(或者从捕获被部分遮挡的面部图像的额外近红外图像)中获取。
对于给定的关键点图 L ′ L' L′,可以通过面部几何回归器 E R E_R ER来预测面部参数。
同时,根据之前描述的单示例设置,给定完整的用户面部图像 I 0 I_0 I0,预先训练的虚拟角色估计器 E T E_T ET用于估计用户的初始面部纹理 T 0 T_0 T0,关键点预测器 E L E_L EL被应用于获取用户的初始关键点图 L 0 L_0 L0。
我们提出了一个基于风格的纹理转换器 S T S_T ST,用于根据给定的关键点图 L 0 L_0 L0和 L ′ L' L′,将初始面部纹理 T 0 T_0 T0变形为目标纹理 T ′ T' T′,这是通过计算给定的关键点图 L 0 L_0 L0和 L ′ L' L′之间的残差信息 Δ S \Delta{S} ΔS得到的。
最后,对于每个源帧 I ′ I' I′,结合面部参数和纹理 T ′ T' T′,使用虚拟角色生成器 D A D_A DA生成最终的虚拟角色 Y Y Y。
2.1 几何回归器
几何回归器(Geometry Regressor,图中 E R E_R ER)
使用单视角图像直接合成整个以顶点表示的3D面部模型是一个非常复杂的问题。
受先前工作的启发,我们采用FLAME 8作为可塑模型,它需要三种参数:
姿态 θ \theta θ
表情 ψ \psi ψ
形状 β \beta β
来生成3D面部网格。与对整个面部进行复杂几何建模相比,使用像FLAME这样的可塑模型有一个优势,即具有较低自由度的表示,使我们能够设计一个轻量级的几何回归器 E R E_R ER来估计FLAME参数并实时生成虚拟角色。
几何回归器 E R E_R ER仅估计姿态参数 θ ′ \theta' θ′和表情参数 ψ ′ \psi' ψ′。
形状参数 β ′ \beta' β′由虚拟角色估计器 E T E_T ET根据完整的用户面部图像 I 0 I_0 I0进行估计。
在第3.3节中,我们将展示当几何回归器 E R E_R ER不包含形状参数进行回归时效果更好。
2.2 基于风格的纹理转换器
基于风格的纹理转换器(Style-based Texture Translator,图中 S T S_T ST)
基于风格的纹理转换器 S T S_T ST接收一个风格编码 Δ S \Delta{S} ΔS,它是关键点图的残差信息,用于估计动画纹理映射 T ′ T' T′。
我们通过映射网络 M M M从2D关键点图 L ′ L' L′中提取信息,输出 S ′ S' S′包含主体身份和源表情的信息。
类似地,映射网络 M M M被应用于2D关键点图 L 0 L_0 L0,以提取 S 0 S_0 S0,其中包含主体身份和中性表情的信息。
为了减少对主体身份的依赖并仅保留表情信息,我们将 S ′ S' S′和 S 0 S_0 S0的残差作为风格编码,即:
Δ S = S ′ − S 0 (1) \Delta{S}=S'-S_0\tag{1} ΔS=S′−S0(1)
如图2(b)所示,纹理转换器 S T S_T ST由 N N N个编码块, { E i } i = 1 N \lbrace{E_i}\rbrace^N_{i=1} {Ei}i=1N,和 N N N个基于风格的堆叠扭曲(style-based stacked warping)块, { D i } i = 1 N \lbrace{D_i}\rbrace^N_{i=1} {Di}i=1N,组成,其中的跳跃连接(skip-connection)类似U-net架构。
在给定 Δ S \Delta{S} ΔS的条件下,每个基于风格的堆叠扭曲块 D i D_i Di将前一层的输出特征 D i + 1 D_{i+1} Di+1和 E i E_i Ei作为输入。
更具体地说,每个扭曲块 D i D_i Di是一个具有调制卷积层的StyleGAN生成器,其公式为:
f D i = U p s a m p l e ( c o n v m ( D i ( f D i + 1 , f E i ) , Δ S ) ) (2) f_{D_i}=Upsample(convm(D_i(f_{D_{i+1}},f_{E_i}),\Delta{S}))\tag{2} fDi=Upsample(convm(Di(fDi+1,fEi),ΔS))(2)
请注意, f D 0 f_{D_0} fD0是最终的动画纹理映射 T ′ T' T′。
在第3.2节中,我们验证了将PixelShuffle 14作为上采样过程应用,相较于使用反卷积层,可以提高细粒度生成质量。通过在不同感受野中提供风格编码,纹理转换器 S T S_T ST能够生成具有特定风格的全局表示。
11 Safa C Medin,Bernhard Egger,Anoop Cherian,Ye Wang,Joshua B Tenenbaum,Xiaoming Liu和Tim K Marks。2022年。MOST-GAN:用于解缠面部图像操作的3D可塑StyleGAN。《AAAI人工智能会议论文集》。
12 Moustafa Meshry,Saksham Suri,Larry S Davis和Abhinav Shrivastava。2021年。学习用于少样本说话头合成的空间表示。《IEEE / CVF计算机视觉国际会议论文集》。
13 Alexander Richard,Colin Lea,Shugao Ma,Jurgen Gall,Fernando De la Torre和Yaser Sheikh。2021年。编解码器角色的音频和凝视驱动面部动画。《IEEE / CVF冬季计算机视觉应用会议论文集》。
1 Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques.
2 Chen Cao, Tomas Simon, Jin Kyu Kim, Gabe Schwartz, Michael Zollhoefer, Shun-Suke Saito, Stephen Lombardi, Shih-En Wei, Danielle Belko, Shoou-I Yu, et al. 2022. Authentic volumetric avatars from a phone scan. ACM Transactions on Graphics (TOG) (2022).
3 Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. 2021. Learning an animatable detailed 3D face model from in-the-wild images. ACM Transactions on Graphics (ToG) (2021).
4 Kuangxiao Gu, Yuqian Zhou, and Thomas Huang. 2020. Flnet: Landmark-driven fetching and learning network for faithful talking facial animation synthesis. In Proceedings of the AAAI conference on artificial intelligence.
5 Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. CVPR (2017).
6 Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer.
7 Reinhard Knothe, Brian Amberg, Sami Romdhani, Volker Blanz, and Thomas Vetter. 2011. Morphable Models of Faces. In Handbook of Face Recognition. Springer.
8 Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. (2017).
9 Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep appearance models for face rendering. ACM Transactions on Graphics (ToG) (2018).
10 Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).
11 Safa C Medin, Bernhard Egger, Anoop Cherian, Ye Wang, Joshua B Tenenbaum, Xiaoming Liu, and Tim K Marks. 2022. MOST-GAN: 3D morphable StyleGAN for disentangled face image manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence.
12 Moustafa Meshry, Saksham Suri, Larry S Davis, and Abhinav Shrivastava. 2021. Learned Spatial Representations for Few-shot Talking-Head Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
13 Alexander Richard, Colin Lea, Shugao Ma, Jurgen Gall, Fernando De la Torre, and Yaser Sheikh. 2021. Audio-and gaze-driven facial animation of codec avatars. In Proceedings of the IEEE/CVF winter conference on applications of computer vision.
14 Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition.
15 Jiale Tao, Biao Wang, Borun Xu, Tiezheng Ge, Yuning Jiang, Wen Li, and Lixin Duan. 2022. Structure-Aware Motion Transfer with Deformable Anchor Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
16 Shih-En Wei, Jason Saragih, Tomas Simon, Adam W Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino, and Yaser Sheikh. 2019. VR facial animation via multiview image translation. ACM Transactions on Graphics (TOG) (2019).
17 Zili Yi, Qiang Tang, Vishnu Sanjay Ramiya Srinivasan, and Zhan Xu. 2020. Animating through warping: An efficient method for high-quality facial expression animation. In Proceedings of the 28th ACM international conference on multimedia.
18 Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision.
19 Ruiqi Zhao, Tianyi Wu, and Guodong Guo. 2021. Sparse to dense motion transfer for face image animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision.
19 Ruiqi Zhao, Tianyi Wu, and Guodong Guo. 2021. Sparse to dense motion transfer for face image animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.