半暹罗训练用于浅层人脸学习

Semi-Siamese Training for Shallow Face Learning

作者：Hang Du， Hailin Shi ， Yuchi Liu ， Jun Wang ， Zhen Lei ， Dan Zeng
机构：NLPR， Institute of Automation， Chinese Academy of Sciences， Beijing， China
联系方式：{duhang， dzeng}@shu.edu.cn， {shihailin， wangjun492， tmei}@jd.com
项目地址：github.com/JDAI-CV/Fac...

摘要

大多数现有的公共人脸数据集，如MS-Celeb-1M和VGGFace2，在广度(大量ID)和深度(足够数量的样本)方面提供了丰富的信息用于训练。然而，在许多实际的人脸识别场景中，训练数据集在深度上是有限的，即每个ID只有两张人脸图像可用。我们将这种情况定义为浅层人脸学习，并发现现有的训练方法存在问题。与深层人脸数据不同，浅层人脸数据缺乏类内多样性。因此，它可能导致特征维度坍塌（collapsed dimensio），从而导致学习到的网络容易在坍塌维度中退化和过拟合。在本文中，旨在通过引入一种名为半孪生训练( Semi-Siamese Training SST)的新型训练方法来解决这个问题。一对半孪生网络构成了前向传播结构，训练损失通过更新图库队列计算，对浅层训练数据进行有效优化。我们的方法不依赖于额外的依赖关系，因此可以灵活地与现有的损失函数和网络架构集成。在各种人脸识别基准测试上的广泛实验表明，所提出的方法显著改善了训练效果，不仅在浅层人脸学习中如此，而且对于传统的深层人脸数据也是如此。

关键词：人脸识别，浅层人脸学习

1 Introduction

人脸识别(FR)在过去几年取得了显著进展，并被广泛应用。这可以归因于三个方面，包括卷积神经网络(CNNs)[26，15，31，16]，损失函数[29，28，37，23，44，36]和大规模训练数据集[40，12，18，1]。近年来，常用的公共训练数据集，如CASIA-WebFace [40]，MS-Celeb-1M [12]，VGGFace2 [1] 等提供了丰富的信息，不仅在广度上(大量的ID)，还在深度上(每个ID有数十张人脸图像)。在本文中，将这种类型的数据集称为深度人脸数据。不幸的是，在许多现实场景中并不存在这样的深度人脸数据。通常，训练会遇到"浅层人脸数据"的问题，即每个ID只有两张人脸图像可用(通常是一张注册照片（registration photo）和一张点照片（spot photo），所谓的"图库"galley和"探针"probe)。因此，它缺乏类内多样性，这阻碍了网络的有效优化，并导致特征维度的塌陷。在这种情况下，发现现有的训练方法要么出现模型退化问题，要么出现过拟合问题。

维度坍塌：学到的嵌入都只存在于某个子空间（Subspace）。比如让模型学一个 3 维的嵌入，模型偷懒学了一个表面上是 3 维的嵌入实际上只有 2 维有用，剩下一维被浪费了。虽然没有完全的坍塌，但是这也使得区别不同嵌入变的困难，因为多一个维度会让空间变的更加的稀疏，向量之间并不拥挤，更容易让下游任务的分类器来区分不同的嵌入从而获得更好的泛化性

在本文中，将浅层人脸数据的训练视为一个特定的任务，称为浅层人脸学习(Shallow Face Learning， SFL) 。SFL与人脸识别中的现有问题------低样本学习(LSL)类似，但它们有两个显著的区别。首先，LSL执行闭集识别[11，38，3，34]，而SFL包括开放集识别任务，其中测试ID从训练ID中排除。其次，LSL需要在源域(具有深度数据)进行预训练，然后再对目标域进行微调[47，3，41]，然而，预训练并不总是对于人脸识别的实际开发是一个好的选择，原因如下：

一旦完成预训练，网络架构就固定了，因此在微调中改变架构是不方便的
部署新架构需要从预训练重新开始，而预训练通常耗时较长；
预训练数据和微调数据之间存在领域差距，因此微调仍然受到浅层数据问题的影响。

因此，SFL主张直接从头开始在浅层人脸数据上进行训练。

简而言之，浅层人脸学习的目标是从零开始对浅层人脸数据进行有效的训练，以实现开放集人脸识别 。我们回顾了当前的方法，并研究了它们如何受到浅层数据问题的影响。近年来，大多数主流的深度人脸识别方法[21，33，32，7，20]都是基于softmax或其变体的分类学习发展而来的。它们建立在全连接(FC)层、softmax函数和交叉熵损失之上。FC层的权重可以被视为代表每个类别中心的原型。学习目标是最大化对真实类别的预测概率 。这种方法在深度数据上具有很强的学习和区分能力。然而，由于浅层数据导致了极度缺乏类内信息，如第3.1节所示，发现这种训练方法要么会导致模型退化，要么会导致过拟合。

人脸识别中的另一个主要常规是嵌入学习方法[6，13，28，24，27]，它可以在没有分类层的情况下学习人脸表示。例如，对比损失[28]和三元组损失[24]计算样本之间的欧氏距离，并通过样本关系优化模型。一般来说，当数据变得浅时，嵌入学习的表现优于分类学习 。潜在原因是嵌入学习利用样本之间的特征比较，而不是将它们分类到包含大量参数的特定类别中。

然而，嵌入学习的性能和效率取决于批量匹配的样本对数量，这受到GPU内存和硬采样策略的限制 。在本文中，希望利用嵌入学习的优势来实现浅层数据的成功分类学习 。如果解决了模型退化和过拟合的问题，训练将极大地受益于分类学习的能力和效率。一个直接的解决方案是简单地将两个过程结合起来，使用样本特征作为原型来初始化全连接权重，并用它们进行分类学习 。类似的修改已经在之前的方法中提出[47]。具体而言，对于浅层数据的每个ID，使用一张照片作为初始原型，另一张照片作为训练样本。然而，当在浅层数据上进行训练时(例如图4.3中的DP-softmax)，这种原型初始化仍然只能带来有限的改进。为了解释这个结果，假设原型与其同类训练样本变得过于相似，导致梯度极小并阻碍了优化。

为了解决这个问题，我们提出从扩大类内多样性的角度改进训练方法 。以对比损失或三元组损失为例，特征由骨干网络提取。骨干网络可以被视为一对(或三元组)孪生网络，因为参数在网络之间完全共享。我们发现解决问题的关键技术是强制骨干网络成为半孪生网络，这意味着两个网络具有接近(但不完全相同)的参数。其中一个网络从图库中提取特征作为原型，另一个网络从探测中提取特征作为训练样本，对于训练中的每个ID，特征之间的类内多样性由网络之间的差异保证。

有许多方法可以限制这两个网络之间的差异。例如，在随机梯度下降(SGD)更新过程中，可以在参数之间添加网络约束；或者对于一个使用SGD更新，另一个使用移动平均更新 (如[14]提出的动量方法)。我们进行了广泛的实验，并发现所有这些方法都对浅层人脸学习有效。此外，将半孪生骨干与基于更新的特征原型队列(即图库队列)结合起来，取得了显著的浅层人脸学习改进 。我们将这种训练方案称为半孪生训练，可以与任何现有的损失函数和网络架构集成。正如第4.3节所示，无论使用哪种损失函数，通过使用所提出的方法进行浅层人脸学习都可以获得很大的改进。

此外，进行了两个额外的实验来展示SST在广泛范围内的优势。

(1)尽管SST是针对浅层数据问题提出的，但对传统深度数据的实验表明，使用SST仍然可以获得领先的性能。
(2)另一个验证SST在实际场景中的有效性的实验，采用预训练-微调设置，也表明SST优于传统的训练方法。

总之，本文包括以下贡献：

正式描述了人脸识别的一个关键问题，即浅层人脸学习，该问题严重影响了人脸识别的训练。这个问题存在于许多真实场景中，但以前被忽视了。
通过深入实验研究了浅层人脸学习问题，并发现缺乏类内多样性阻碍了优化 ，导致特征空间塌陷。在这种情况下，模型在训练中遭受退化和过拟合。
提出了半孪生训练(SST)方法来解决浅层人脸学习中的问题。SST能够灵活地与现有的损失函数和网络架构进行组合。
进行了全面的实验，展示了SST对浅层人脸学习的显著改进。此外，额外的实验还表明SST在传统深度数据和预训练-微调任务中也具有优势。

2.1 Deep Face Recognition

在深度人脸识别中有两种主要的方案。一方面，基于分类的方法是从softmax损失及其变体发展而来的。

SphereFace [21]引入了角度边界来扩大类别之间的差距。
CosFace [33]和AM-softmax [32]提出了对正logit的加性边界
ArcFace [7]在余弦内部采用了加性角度边界，并给出了更清晰的几何解释。

另一方面，基于特征嵌入方法.

Contrastive loss对比损失[6，13，28]和Triplet loss三元组损失[24]，计算样本对或三元组之间的成对欧氏距离，并优化网络之间的关系。
N-pairs损失[27]通过每个小批量的局部softmax公式优化正负对。

除了这两种方案，Zhu等人[47]提出了一种分类-验证-分类训练策略和DP-softmax损失，以逐步提高ID与spot人脸识别任务的性能。

2.2 Low-shot Face Recognition

低样本学习(LSL)在人脸识别中旨在通过少量的人脸样本进行近距离的ID识别。

Choe等人[5]使用数据增强和生成方法来扩大训练数据集。
Cheng等人[3]提出了一种包含最优dropout、选择性衰减、L2归一化和模型级优化的强制softmax。
Wu等人[38]通过使用CNN和最近邻模型开发了混合分类器。
Guo等人[11]提出将一次性类别和正常类别的权重向量的范数对齐
Yin等人[41]通过将主成分从正常类别转移到低样本类别来增强低样本类别的特征空间。

上述方法专注于MS-Celeb-1M低样本学习基准[11]，该基准对于基础集中的每个ID具有相对充足的样本，并且对于新集合中的每个ID只有一个样本，目标是从基础集和新集合中识别人脸。

然而，正如前一节所讨论的，浅层人脸学习SFL和LSL之间存在两个方面的差异。

首先，LSL方法旨在进行闭集分类，例如，在MS-Celeb-1M低样本学习基准测试中，测试ID包含在训练集中；而浅层人脸学习SFL包括开放集识别，其中测试样本属于未见过的类别。
其次，与LSL通常从源数据集(预训练)到目标低样本数据集(微调)进行迁移学习不同，浅层人脸学习主张从头开始在目标浅层数据集上进行训练。

2.3 Self-supervised Learning

最近的自监督方法[8，39，48，14]在视觉表示学习方面取得了令人兴奋的进展。Exemplar CNN [8]首次引入了替代类别的概念，它在训练和测试过程中采用了参数化范式。Memory Bank [39]将实例级判别问题形式化为度量学习问题，其中实例之间的相似性是通过非参数化的方式从特征中计算得出的。MoCo [14]提出了一个带有队列和动量更新编码器的动态字典，可以实时构建一个大型且一致的字典，促进对比无监督学习。这些方法将每个训练样本视为一个实例级别的类别。尽管它们对每个样本进行了数据增强，但实例级别的类别仍然缺乏类内多样性，这类似于浅层人脸学习问题 。受到自监督学习方法的有效性的启发，我们使用类似的技术来解决浅层人脸学习中的问题，例如半孪生骨干的移动平均更新(moving-average updating)和监督损失的原型队列。

举例来说，SST（Shallow Face Learning）的gallery队列是基于画廊集中的样本构建，而不是样本增强技术；SST旨在处理浅层人脸学习，这是监督学习中的一个特定任务。从学习对抗缺乏类内多样性的角度来看，我们的方法将自监督方案的优势推广到浅层数据上的监督方案。

3 The Proposed Approach

3.1 Shallow Face Learning Problem

浅层人脸学习是现实世界人脸识别场景中的一个实际问题。例如，在身份验证应用中，人脸数据通常包含每个身份证的注册照片（gallery）和现场照片（probe探针）。身份证号码可能很多，但浅层数据的缺乏导致类内信息极度匮乏。在这里，研究当前基于分类的方法如何受到这个问题的影响，以及浅层数据带来的后果。

目前大多数主流方法都是从 softmax 或其变体发展而来的，这包括一个全连接（FC）层，softmax 函数和交叉熵损失。FC 层的输出是第 <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>i 个样本特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> x i x_i </math>xi 与第 <math xmlns="http://www.w3.org/1998/Math/MathML"> j j </math>j 个类别权重 <math xmlns="http://www.w3.org/1998/Math/MathML"> w j w_j </math>wj 的内积 <math xmlns="http://www.w3.org/1998/Math/MathML"> w j T x i w_j^Tx_i </math>wjTxi。当特征和权重通过它们的 L2 范数进行归一化时，内积等于余弦相似度 <math xmlns="http://www.w3.org/1998/Math/MathML"> w j T x i = c o s ( θ i ， j ) w_j^Tx_i = cos(θ_i，j) </math>wjTxi=cos(θi，j)。为了简化，以传统的 softmax 为例，损失函数（省略偏置项）可以被表达为：

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N 是批处理大小， <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n 是类别数， <math xmlns="http://www.w3.org/1998/Math/MathML"> s s </math>s 是缩放参数， <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y 是第 <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>i 个样本的实际标签。学习目标是最大化类内成对相似度 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y T x i w_y^Tx_i </math>wyTxi 并最小化类间成对相似度 <math xmlns="http://www.w3.org/1998/Math/MathML"> w j T x i w_j^Tx_i </math>wjTxi，以实现类内紧凑特征和类间分离。对数内的项是在实际类别上的预测概率 <math xmlns="http://www.w3.org/1998/Math/MathML"> P y = P_y= </math>Py=

进而可以得到：

这个方程意味着原型 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 的最优解满足两个条件。

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> n y n_y </math>ny 是该类中的样本数量。条件（i）意味着理想情况下，最优原型 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 将是等于该类中特征的平均值的类中心。与此同时，条件（ii）使原型 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 有可能在许多维度上坍缩为零。当 <math xmlns="http://www.w3.org/1998/Math/MathML"> n y n_y </math>ny 足够大（深层数据）时， <math xmlns="http://www.w3.org/1998/Math/MathML"> x i x_i </math>xi 的多样性很大，因此保持原型 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y = 1 n y ∑ i = 1 n y x i w_y = \frac{1}{n_y}\sum_{i=1} ^{n_y}x_i </math>wy=ny1∑i=1nyxi 远离坍缩。而在浅层数据（ <math xmlns="http://www.w3.org/1998/Math/MathML"> n y = 2 n_y=2 </math>ny=2）中，原型 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 仅由类中的两个样本确定，即gallery <math xmlns="http://www.w3.org/1998/Math/MathML"> x g x_g </math>xg 和probe <math xmlns="http://www.w3.org/1998/Math/MathML"> x p x_p </math>xp。

结果是，三个向量 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy、 <math xmlns="http://www.w3.org/1998/Math/MathML"> x y x_y </math>xy 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> x p x_p </math>xp 会迅速变得非常接近（ <math xmlns="http://www.w3.org/1998/Math/MathML"> w y ≈ x g ≈ x p w_y ≈ x_g ≈ x_p </math>wy≈xg≈xp），而这个类将达到非常小的损失值。考虑到网络是通过 SGD 逐批次训练的，在每次迭代中，网络在少数类上拟合良好，在其他类上拟合不佳，因此总损失值将波动，训练将受到影响 （如图5a点线所示）。此外，由于所有类别在特征空间中逐渐失去了类内多样性 ，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> x g ≈ x p x_g ≈ x_p </math>xg≈xp，通过条件（ii）推动原型 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 在大多数维度上趋近于零，并且无法覆盖一个具有区分性的特征空间。

为了探索浅层数据问题带来的后果，使用 softmax、A-softmax [21]、AM-softmax [32] 和 Arc-softmax [7] 的损失函数对深层数据和浅层数据进行了实验。深层数据是 MS1M-v1c [30]（MS-Celeb-1M [12] 的清理版本）。浅层数据是 MS1M-v1c 的子集，从深层数据中随机选择每个身份证的两张人脸图像。表1不仅显示了在 LFW [17] 上的测试准确度，还显示了在训练数据上的准确度。我们发现，当训练数据从深到浅时，softmax 和 A-softmax 在训练和测试中的性能都较低，而 AM-softmax 和 Arc-softmax 在训练时较高但在测试时较低 。因此，我们认为 softmax 和 A-softmax 遭受了模型退化问题，而 AM-softmax 和 Arc-softmax 遭受了过拟合问题 。为了进一步支持这一观点，检查原型 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 中每个条目的值，并使用 Parzen 窗口计算分布。分布显示在图2a中，横轴表示entry值，纵轴表示密度。可以看到，大多数原型的entry都趋近于零，这意味着特征空间在大多数维度上坍缩。在这样的降维空间中，模型可能很容易发生退化或过拟合。

3.2 Semi-Siamese Training

通过以上分析，可以看到，当数据变得浅层时，当前的方法受到模型退化和过拟合问题的影响，而根本原因在于特征空间的坍缩。为了解决这个问题，有两个方向可以继续进行：

(1) 正确更新 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> x i x_i </math>xi
(2) 保持 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 的entroy远离零

在第一个方向上，主要问题是网络受到有效优化的阻碍 。回顾方程（2）中浅层人脸学习的条件（i），其中每个身份证仅有两张人脸图像，分别用 <math xmlns="http://www.w3.org/1998/Math/MathML"> I g I_g </math>Ig（gallery）和 <math xmlns="http://www.w3.org/1998/Math/MathML"> I p I_p </math>Ip（probe）表示，它们的特征分别为 <math xmlns="http://www.w3.org/1998/Math/MathML"> x g = φ ( I g ) x_g= φ(I_g) </math>xg=φ(Ig) 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> x p = φ ( I p ) x_p=φ(I_p) </math>xp=φ(Ip)，其中 φ 是孪生网络的骨干。根据条件（i）， <math xmlns="http://www.w3.org/1998/Math/MathML"> w y = 1 2 ( x g + x p ) w_y=\frac{1}{2}(x_g+x_p) </math>wy=21(xg+xp)。由于类内多样性不足，gallery和probe的特征通常非常接近 ，因此 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y = 1 2 ( x g + x p ) ≈ x g ≈ x p w_y=\frac{1}{2}(x_g+x_p)≈x_g≈x_p </math>wy=21(xg+xp)≈xg≈xp。正如在前面的子节中所研究的，这种情况会导致损失值波动，阻止网络的有效优化 。解决这个问题的基本思想是保持 <math xmlns="http://www.w3.org/1998/Math/MathML"> x g x_g </math>xg 与 <math xmlns="http://www.w3.org/1998/Math/MathML"> x p x_p </math>xp 之间的一定距离 ，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> x g = x p + ε x_g=x_p+ε </math>xg=xp+ε，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∀ ε > 0 ∀ ε > 0 </math>∀ε>0。为了保持 <math xmlns="http://www.w3.org/1998/Math/MathML"> x g x_g </math>xg 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> x p x_p </math>xp 之间的距离，提出将孪生网络 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ φ </math>φ 设计为半孪生 。具体而言，gallery集网络 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ g φ_g </math>φg 输入galley，probe集网络 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ p φ_p </math>φp 输入probe。 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ g φ_g </math>φg 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ p φ_p </math>φp 具有相同的架构但具有不同的参数 ，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ g = φ p + ε ′ φ_g = φ_p + ε' </math>φg=φp+ε′，使得它们的特征不会被相互吸引 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ g ( I g ) = φ p ( I p ) + ε φ_g(I_g) = φ_p(I_p) + ε </math>φg(Ig)=φp(Ip)+ε.有一些选择可以实现半孪生网络。例如，可以在训练损失中添加一个网络约束 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ ∣ φ g − φ p ∣ ∣ < ε ′ ||φ_g − φ_p||<ε' </math>∣∣φg−φp∣∣<ε′，例如 <math xmlns="http://www.w3.org/1998/Math/MathML"> L t o t a l = L + λ ∗ ∣ ∣ φ g − φ p ∣ ∣ L_{total} = L + λ ∗ ||φ_g − φ_p|| </math>Ltotal=L+λ∗∣∣φg−φp∣∣，其中非负参数 <math xmlns="http://www.w3.org/1998/Math/MathML"> λ λ </math>λ 用于在训练损失中平衡网络约束。另一种选择，正如MoCo [14]建议的那样，是以动量方式更新画廊集网络。

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> m m </math>m 是移动平均的权重，而probe集网络 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ p φ_p </math>φp 根据训练损失使用 SGD 进行更新。 <math xmlns="http://www.w3.org/1998/Math/MathML"> λ λ </math>λ 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> m m </math>m 都是保持 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ g φ_g </math>φg 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> φ p φ_p </math>φp 相似的 <math xmlns="http://www.w3.org/1998/Math/MathML"> ε ′ ε' </math>ε′ 的实例。我们比较了半孪生网络的不同实现，发现在实验中，采用移动平均风格能够显著提高性能 。由于维持了类内多样性，训练损失稳步减小而不发生波动（图5a中的实线曲线）。

在第二个方向上，一个直接的想法是在训练损失中添加一个原型约束以扩大原型的entroy ，例如 <math xmlns="http://www.w3.org/1998/Math/MathML"> L + β ( α − ∣ ∣ w y ∣ ∣ ) L + β(α − ||w_y||) </math>L+β(α−∣∣wy∣∣)，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> α α </math>α 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> β β </math>β 是参数。然而，我们发现这种技术会不加选择地扩大大多数维度中的entroy（图2b中的绿色分布），并导致性能下降（表2）。主张不是操纵 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy，而是用gallery特征 <math xmlns="http://www.w3.org/1998/Math/MathML"> x g x_g </math>xg 代替 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 作为原型。因此，原型完全依赖于骨干的输出，避免了参数（entroy） <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 为零的问题 。图2b中的红色分布显示了基于特征的原型避免了坍缩问题，与原型约束相比，保留了更多的有区别的成分。移除 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 还减轻了重参数的过拟合风险。整个原型集通过维护一个gallery队列进行更新。某些自学习方法 [39，14] 已经研究了这种技术及其进一步的优势，例如在遇到未见过的测试身份证时具有更好的泛化能力。

总的来说，我们的半孪生训练方法是为了解决浅层人脸学习问题而沿着两个方向进行开发的。前向传播骨干由一对半孪生网络构成，分别负责gallery和probe的特征编码；训练损失是使用更新的gallery队列计算的，因此网络在浅层数据上得到了有效的优化。这个训练方案可以与任何形式的现有损失函数（无论是分类损失还是嵌入损失）和网络架构集成（如图3所示）。

SST(Semi-Siamese Training)是一种针对浅层数据的人脸识别模型训练方法，所训练模型为一对半孪生网络，包括一个主模型和一个副模型，每次迭代时网络输入为同一ID的两张人脸图像(注册照和现场照)，副模型从注册照中提取人脸特征并构成一个动态的特征队列，随着训练进行同步更新，根据主模型从现场照中提取的人脸特征和动态特征队列计算损失函数，得到损失值后主模型采用随机梯度下降的方式进行更新，副模型基于当前模型状态与主模型采用滑动平均的方式进行更新，训练完成后主模型用于人脸识别测试。

4 Experiments

本节的结构如下，第4.1节介绍了数据集和实验设置，第4.2节包括对 SST 的消融研究，第4.3节展示了 SST 在浅层人脸学习中使用各种损失函数取得的显著改进，第4.4节展示了在各种骨干下 SST 的收敛性，第4.5节显示 SST 在深层人脸数据上也能够取得领先的性能，第4.6节研究了 SST 在预训练-微调任务上也优于传统训练方法。

4.1 Datasets and Experimental Settings

训练数据Train Data 。为了证明可重复性，使用公共数据集进行训练。为了构建浅层数据，从 MS1M-v1c [30] 数据集中为每个身份证随机选择了两张图像。因此，浅层数据包括72778个身份证和145556张图像。对于深层数据，使用完整的 MS1M-v1c 数据集，其中每个身份证平均有44张图像。此外，利用了一个真实世界的监控人脸识别基准 QMUL-SurvFace [4] 来进行预训练-微调实验。

测试数据Test Data。为了进行全面的评估，采用了 LFW [17]、BLUFR [19]、AgeDB-30 [22]、CFP-FP [25]、CALFW [46]、CPLFW [45]、MegaFace [18] 和 QMUL-SurvFace [4] 数据集。AgeDB-30 和 CALFW 专注于大年龄差异的人脸验证。CFP-FP 和 CPLFW 旨在进行姿态变体的人脸验证。BLUFR 专为在低误认率（FAR）下进行评估，在 BLUFR 上报告了最低 FAR（1e-5）时的验证率。MegaFace 还评估了大规模人脸识别的性能，涉及数百万个干扰者。QMUL-SurvFace 测试集旨在进行真实世界的监控人脸识别，并与上述基准相比存在很大的领域差距。

预处理Prepossessing。所有人脸图像均通过 FaceBoxes [42] 进行检测。然后，通过五个面部标志 [10] 对人脸进行对齐和裁剪，使其成为 144×144 的 RGB 图像。

CNN 架构。为了平衡性能和时间成本，在消融研究和使用各种损失函数的实验中使用骨干网络是MobileFaceNet [2]。此外，在深层数据和预训练-微调实验中，使用 Attention-56 [31]。输出是一个512维的特征。此外，还使用其他骨干，包括 VGG-16 [26]、SE-ResNet-18 [16]、ResNet-50 和 -101 [15]，以证明 SST 在不同架构下的收敛性。

训练和评估 。使用四块 NVIDIA Tesla P40 GPU 进行训练。批量大小为256，学习率从0.05开始。在浅层数据实验中，学习率在36k、54k迭代时除以10，并在64k迭代时结束训练 。对于深层数据，学习率在72k、96k、112k迭代时除以10，并在120k迭代时结束训练 。对于预训练-微调实验，学习率从0.001开始，在6k、9k迭代时除以10，并在10k迭代时结束。gallery队列的大小取决于训练数据集中的类别数量，因此经验性地将其设置为16，384 （用于浅层和深层数据）和2，560（用于QMUL-SurvFace）。在评估阶段，提取探针集网络的最后一层输出作为人脸表示。余弦相似度被用作相似性度量。为了进行严格和精确的评估，根据列表 [35] 移除了训练和测试数据集之间的所有重叠身份证。

损失函数。SST 可以灵活地与现有的训练损失函数集成。考虑了分类和嵌入学习损失函数作为基线，并与 SST 的集成进行比较。分类损失函数包括 A-softmax [21]、AM-softmax [32]、Arc-softmax [7]、AdaCos [43]、MV-softmax [36]、DP-softmax [47] 和 Center loss [37]。嵌入学习方法包括 Contrastive [28]、Triplet [24] 和 N-pairs [27]。

SST的实现

4.2 Ablation Study

我们分析了 SST 中的每个技术，并将它们与前面提到的其他选择进行比较，如网络约束（ <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ ∣ φ g − φ p ∣ ∣ < ε ′ ||φ_g−φ_p|| < ε' </math>∣∣φg−φp∣∣<ε′）和原型约束（ <math xmlns="http://www.w3.org/1998/Math/MathML"> β ( α − ∣ ∣ w y ∣ ∣ ) β(α − ||w_y|| ) </math>β(α−∣∣wy∣∣)）。表2将它们与四种基本损失函数（softmax、A-Softmax、AM-Softmax 和 Arc-softmax）的性能进行了比较。在这个表格中，"Org." 表示普通训练，"A" 表示原型约束，"B" 表示网络约束，"C" 表示gallery队列，"D" 表示"B" 和 "C"的组合，"SST" 表示半孪生训练的最终方案，包括移动平均更新的半孪生网络和带有gallery队列的训练方案。

从表2中可以得出以下结论：

(1) 单纯的原型约束"A"在大多数项中导致减少，这意味着在每个维度上不加选择地扩大 <math xmlns="http://www.w3.org/1998/Math/MathML"> w y w_y </math>wy 对浅层人脸学习没有帮助；
(2) 网络约束"B"和gallery队列"C"导致逐步增加，"D"它们的组合获得了进一步的改进；
(3) 最终，SST采用了移动平均更新和gallery队列，并通过所有项取得了最佳结果。

比较表明，SST很好地解决了浅层人脸学习中的问题，并在测试准确性方面取得了显著的改进。

4.3 SST with Various Loss Functions

首先，使用各种损失函数在浅层数据上训练网络，并在 BLUFR 上进行测试，FAR=1e-5（图4中的蓝色条）。这些损失函数包括分类和嵌入式损失，如softmax、A-softmax、AM-softmax、Arc-softmax、AdaCos、MV-softmax、DP-softmax、Center loss、Contrastive、Triplet 和 N-pairs。

然后，使用相同的网络和相同的损失函数在浅层数据上进行 SST 方案的训练。如图4所示，SST 可以灵活地与每种损失函数集成，并在浅层人脸学习中取得了显著的提高 （橙色条）。此外，在使用 MV-softmax 和嵌入式损失进行训练时采用了困难样本挖掘策略。结果证明 SST 也可以与困难样本挖掘策略很好地配合工作。

4.4 SST with Various Network Architectures

为了展示训练中的稳定收敛性，使用 SST 训练不同的 CNN 架构，包括 MobileFaceNet、VGG-16、SE-ResNet-18、Attention-56、ResNet-50 和 -101。如图5b所示，传统训练的损失曲线（点状曲线）出现振荡。但每个 SST 的损失曲线（实线曲线）都稳步下降，表明每个网络随着 SST 的训练而收敛。此外，图5b图例中的数字表示每个网络在 BLUFR 上的测试结果。对于传统训练，随着网络结构的加深，测试准确性下降，显示出更大的模型规模加剧了模型退化和过拟合。相反，随着网络变得更加庞大，SST 的测试准确性增加，表明 SST 在更复杂的架构中发挥了越来越大的贡献。

4.5 SST on Deep Data Learning

前面的实验表明，SST 已经成功解决了浅层人脸学习中的问题，并在测试准确性上取得了显著的改进。为了进一步探讨 SST 在更广泛应用中的优势，在深层数据（MS1M-v1c 的完整版本）上采用 SST 方案，并与传统训练进行比较 。表3显示了在 LFW、AgeDB-30、CFP-FP、CALFW、CPLFW、BLUFR 和 MegaFace 上的性能。SST 在大多数测试集中取得了领先的准确性，并在 CALFW 和 BLUFR 上也取得了竞争力的结果 。SST（softmax）在包括大面部姿势或大年龄差异等困难案例的 AgeDB-30、CFP-FP、CALFW 和 CPLFW 上至少提高了1个百分点的准确性。值得注意的是，SST通过减少传统训练中用于计算分类损失的大量 FC 参数。更多深层数据的结果可以参考补充材料。

4.6 Pretrain and Finetune

在现实世界的人脸识别中，公共训练数据集与捕获的人脸图像之间存在很大的域差异。公共训练数据集，如 MS-Celeb-1M 和 VGGFace2，是从互联网收集的规范的人脸图像。但是实际应用通常是非常不同的。为了解决这个问题，典型的做法是在公共训练数据集上预训练网络，然后在真实世界的人脸数据上进行微调。尽管 SST 专门用于从头开始在浅层数据上进行训练，我们仍然有兴趣使用 SST 处理微调任务中的挑战。因此，在这一子节中进行了一项额外的实验，首先在 MS1M-v1c 上进行 softmax 预训练，然后在 QMUL-SurvFace 上进行浅层数据微调，比较了是否使用 SST。从 QMUL-SurvFace 中随机选择两个样本用于每个 ID，构建浅层数据。然后，在 QMUL-SurvFace 浅层数据上使用/不使用 SST 对网络进行微调。评估在 QMUL-SurvFace 测试集上进行。从表4中可以发现，无论是对于分类学习还是嵌入学习，与传统训练相比，SST在验证和识别中都显著提升了性能。

5 Conclusions

在本文中，首次研究了现实世界人脸识别中的一个关键问题，即浅层人脸学习，这在以前被忽视了。分析了现有的训练方法在浅层人脸学习中遇到的问题。核心问题在于训练困难和特征空间坍塌，导致模型退化和过拟合 。然后，我们提出了一种新颖的训练方法，即半孪生训练（SST），以解决浅层人脸学习中的挑战 。具体而言，SST采用了半孪生网络，并利用图库特征构建图库队列以克服这些问题。SST能够灵活集成现有的训练损失函数和网络架构。在浅层数据上的实验证明了SST显著改进了传统训练。此外，额外的实验进一步探讨了SST在更广泛范围内的优势，如深度数据训练和预训练微调。

6 Supplementary Material

6.1 Additional Experiments and Analysis

1. SST on Deep Data Learning

首先，提供了有关在深度数据学习中利用SST的更多详细信息。在每次深度数据训练的迭代中，随机抽样一批ID，对于每个ID，随机抽样两张图像。任意一张图像充当图库，另一张充当探针。因此，每个图像有50%的机会扮演图库或探针的角色。此外，作为主文第4.5节的补充实验，我们评估了具有DP-softmax [47]、Contrastive[28]、Triple[24]和N Pair[27]损失函数的SST。在评估中，使用了七个测试基准，包括LFW [17]、BLUFR [19]、AgeDB [22]、CFP [25]、CALFW [46]、CPLFW [45]、MegaFace [18]。从结果中可以发现在使用SST后，所有损失函数在各种基准上都能取得更好的性能。此外，可以观察到原始嵌入损失函数（即Contrastive、triple和N pair）在严格的FAR范围（例如BLUFR和MegaFace）上性能较差；在与SST集成后，它们在这些基准上取得了显著的改进。

2. Ablation Study

在主文的消融研究中（主文第4.2节），可以看到图库队列和半孪生的任何组合（无论是网络约束还是动量）都为每个训练损失函数提供了最显著的提升。此外，与Arc-softmax相比，AM-softmax从SST中获得了更大的收益。假设Arc-softmax的角度边缘提供了比AM-softmax更强的监督，这样强烈的监督在某种程度上扭曲了特征空间，因为边缘惩罚作用于特征-特征对而不是特征-FC对（特别是从头开始训练）。

3 Pretrain and Finetune

在主文中的预训练和微调实验（主文第4.6节）中，我们可以发现对于基于softmax的方法（softmax、AM-softmax、Arc-softmax）和基于对/三元组的方法（contrastive、 triplet、N pairs）改进是不同的。我们认为原始softmax-based方法在这个实验中由于分类FC层的大量参数会导致次优结果 。在与SST集成之后，FC层被一个更新的特征队列所取代，可以显著缓解优化问题。同时，对/三元组方法采用的是原始版本中的特征而不是FC层。因此，在微调阶段，SST为基于softmax的方法带来的好处大于对/三元组方法。

References

Cao， Q.， Shen， L.， Xie， W.， Parkhi， O.M.， Zisserman， A.: Vggface2: A dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). pp. 67--74 e(2018) 1， 2
Chen， S.， Liu， Y.， Gao， X.， Han， Z.: Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In: Chinese Conference on Biometric Recognition. pp. 428--438 (2018) 10
Cheng， Y.， Zhao， J.， Wang， Z.， Xu， Y.， Jayashree， K.， Shen， S.， Feng， J.: Know you at one glance: A compact vector representation for low-shot learning. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 1924--1932 (2017) 2， 5
Cheng， Z.， Zhu， X.， Gong， S.: Surveillance face recognition challenge. arXiv preprint arXiv:1804.09691 (2018) 10
Choe， J.， Park， S.， Kim， K.， Hyun Park， J.， Kim， D.， Shim， H.: Face generation for low-shot learning using generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 1940--1948 (2017) 5
Chopra， S.， Hadsell， R.， LeCun， Y.: Learning a similarity metric discriminatively， with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. vol. 1， pp. 539--546 (2005) 3， 4
Deng， J.， Guo， J.， Xue， N.， Zafeiriou， S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4690--4699 (2019) 3， 4， 7， 10， 14
Dosovitskiy， A.， Springenberg， J.T.， Riedmiller， M.， Brox， T.: Discriminative unsupervised feature learning with convolutional neural networks. In: Advances in neural information processing systems. pp. 766--774 (2014) 5
Fei-Fei， L.， Fergus， R.， Perona， P.: One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28(4)， 594--611 (2006) 2
Feng， Z.H.， Kittler， J.， Awais， M.， Huber， P.， Wu， X.J.: Wing loss for robust facial landmark localisation with convolutional neural networks. In:Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. pp. 2235--2245 (2018) 10
Guo， Y.， Zhang， L.: One-shot face recognition by promoting underrepresented classes. arXiv preprint arXiv:1707.05574 (2017) 2， 5
Guo， Y.， Zhang， L.， Hu， Y.， He， X.， Gao， J.: Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: European Conference on Computer Vision. pp. 87--102 (2016) 1， 7
Hadsell， R.， Chopra， S.， LeCun， Y.: Dimensionality reduction by learning an invariant mapping. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. vol. 2， pp. 1735--1742 (2006) 3， 4
He， K.， Fan， H.， Wu， Y.， Xie， S.， Girshick， R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020) 4， 5， 9
He， K.， Zhang， X.， Ren， S.， Sun， J.: Deep residual learning for image recognition. In:Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770--778 (2016) 1， 10
Hu， J.， Shen， L.， Sun， G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132--7141 (2018) 1， 10
Huang， G.B.， Mattar， M.， Berg， T.， Learned-Miller， E.: Labeled faces in the wild: A database forstudying face recognition in unconstrained environments (2008) 7， 10， 15
Kemelmacher-Shlizerman， I.， Seitz， S.M.， Miller， D.， Brossard， E.: The megaface benchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4873--4882 (2016) 1， 10， 15
Liao， S.， Lei， Z.， Yi， D.， Li， S.Z.: A benchmark study of large-scale unconstrained face recognition. In: IEEE international joint conference on biometrics. pp. 1--8 (2014) 10， 15
Liu， H.， Zhu， X.， Lei， Z.， Li， S.Z.: Adaptiveface: Adaptive margin and sampling for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 11947--11956 (2019) 3
Liu， W.， Wen， Y.， Yu， Z.， Li， M.， Raj， B.， Song， L.: Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 212--220 (2017) 3， 4， 7， 10
Moschoglou， S.， Papaioannou， A.， Sagonas， C.， Deng， J.， Kotsia， I.， Zafeiriou， S.: Agedb: the first manually collected， in-the-wild age database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 51--59 (2017) 10， 15
Ranjan， R.， Castillo， C.D.， Chellappa， R.: L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507 (2017) 1
Schroff， F.， Kalenichenko， D.， Philbin， J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 815--823 (2015) 3， 4， 10， 14， 15
Sengupta， S.， Chen， J.C.， Castillo， C.， Patel， V.M.， Chellappa， R.， Jacobs， D.W.:Frontal to profile face verification in the wild. In: 2016 IEEE Winter Conference on Applications of Computer Vision. pp. 1--9 (2016) 10， 15
Simonyan， K.， Zisserman， A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2015) 1， 10
Sohn， K.: Improved deep metric learning with multi-class n-pair loss objective. In: Advances in Neural Information Processing Systems. pp. 1857--1865 (2016) 3， 4， 10， 14， 15
Sun， Y.， Chen， Y.， Wang， X.， Tang， X.: Deep learning face representation by joint identification-verification. In: Advances in neural information processing systems. pp. 1988--1996 (2014) 1， 3， 4， 10， 14， 15
Taigman， Y.， Yang， M.， Ranzato， M.， Wolf， L.: Deepface: Closing the gap to human level performance in face verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1701--1708 (2014) 1
trillionpairs.org: Ms-Celeb-1M-v1c. trillionpairs.deepglint.com/ overview 7， 10
Wang， F.， Jiang， M.， Qian， C.， Yang， S.， Li， C.， Zhang， H.， Wang， X.， Tang， X.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3156--3164 (2017) 1， 10
Wang， F.， Cheng， J.， Liu， W.， Liu， H.: Additive margin softmax for face verification. IEEE Signal Processing Letters 25(7)， 926--930 (2018) 3， 4， 7， 10， 14
Wang， H.， Wang， Y.， Zhou， Z.， Ji， X.， Gong， D.， Zhou， J.， Li， Z.， Liu， W.: Cosface: Large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5265--5274 (2018) 3， 4
Wang， L.， Li， Y.， Wang， S.: Feature learning for one-shot face recognition. In: 2018 25th IEEE International Conference on Image Processing. pp. 2386--2390 (2018) 2
Wang， X.， Wang， S.， Wang， J.， Shi， H.， Mei， T.: Co-mining: Deep face recognition with noisy labels. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9358--9367 (2019) 10
Wang， X.， Zhang， S.， Wang， S.， Fu， T.， Shi， H.， Mei， T.: Mis-classified vector guided softmax loss for face recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 1， 10
Wen， Y.， Zhang， K.， Li， Z.， Qiao， Y.: A discriminative feature learning approach for deep face recognition. In: European conference on computer vision. pp. 499--515 (2016) 1， 10
Wu， Y.， Liu， H.， Fu， Y.: Low-shot face recognition with hybrid classifiers. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 1933--1939 (2017) 2， 5
Wu， Z.， Xiong， Y.， Yu， S.X.， Lin， D.: Unsupervised feature learning via nonparametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3733--3742 (2018) 5， 9
Yi， D.， Lei， Z.， Liao， S.， Li， S.Z.: Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014) 1
Yin， X.， Yu， X.， Sohn， K.， Liu， X.， Chandraker， M.: Feature transfer learning for face recognition with under-represented data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5704--5713 (2019) 2， 5
Zhang， S.， Zhu， X.， Lei， Z.， Shi， H.， Wang， X.， Li， S.Z.: Faceboxes: A cpu real-time face detector with high accuracy. In: 2017 IEEE International Joint Conference on Biometrics. pp. 1--9 (2017) 10
Zhang， X.， Zhao， R.， Qiao， Y.， Wang， X.， Li， H.: Adacos: Adaptively scaling cosine logits for effectively learning deep face representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10823--10832 (2019) 10
Zhao， K.， Xu， J.， Cheng， M.M.: Regularface: Deep face recognition via exclusive regularization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1136--1144 (2019) 1
Zheng， T.， Deng， W.: Cross-pose lfw: A database for studying crosspose face recognition in unconstrained environments. Beijing University of Posts and Telecommunications， Tech. Rep pp. 18--01 (2018) 10， 15
Zheng， T.， Deng， W.， Hu， J.: Cross-age lfw: A database for studying cross-age face recognition in unconstrained environments. arXiv preprint arXiv:1708.08197 (2017) 10， 15
Zhu， X.， Liu， H.， Lei， Z.， Shi， H.， Yang， F.， Yi， D.， Qi， G.， Li， S.Z.: Large-scale bisample learning on id versus spot face recognition. International Journal of Computer Vision vol.127， pp. 684--700 (2019) 2， 3， 5， 10， 14， 15
Zhuang， C.， Zhai， A.L.， Yamins， D.: Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6002--6012 (2019) 5

代码实现

python 复制代码

from networks import MobileFaceNet
from tensorboardX import SummaryWriter
from prototype import Prototype
from datetime import datetime
from torch.utils.data import Dataset, DataLoader
from torch import optim


import os
import argparse
import numpy as np
import torch
import random
import logging as logger
logger.basicConfig(level=logger.INFO, format='%(levelname)s %(asctime)s %(filename)s: %(lineno)d] %(message)s',
                   datefmt='%Y-%m-%d %H:%M:%S')



def train_BN(m):
    classname = m.__class__.__name__
    if classname.find('BatchNorm') != -1:
        m.train()

def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

def moving_average(probe, gallery, alpha):
    for param_probe, param_gallery in zip(probe.parameters(), gallery.parameters()):
        param_gallery.data =  alpha* param_gallery.data + (1 - alpha) * param_probe.detach().data

        
####### shuffleBN for batch, the same as MoCo https://arxiv.org/abs/1911.05722 #######
def shuffle_BN(batch_size):
    shuffle_ids = torch.randperm(batch_size).long().cuda()
    reshuffle_ids = torch.zeros(batch_size).long().cuda()
    reshuffle_ids.index_copy_(0, shuffle_ids, torch.arange(batch_size).long().cuda())
    return shuffle_ids, reshuffle_ids
  
  
def trainlist_to_dict(source_file):
    trainfile_dict = {}
    with open(source_file, 'r') as infile:
        for line in infile:
            l = line.rstrip().lstrip()
            if len(l) > 0:
                lmdb_key, label = l.split(' ')
                label = int(label)
                if label not in trainfile_dict:
                    trainfile_dict[label] = {'lmdb_key':[],'num_images':0}
                trainfile_dict[label]['lmdb_key'].append(lmdb_key)
                trainfile_dict[label]['num_images'] += 1
    return trainfile_dict


def train_sample(train_dict, class_num, queue_size, last_id_list=False):
    all_id = range(0, class_num)
    # Make sure there is no overlap ids bewteen queue and curr batch.
    if last_id_list:
        last_tail_id= last_id_list[-queue_size:]
        non_overlap_id = list(set(all_id) - set(last_tail_id))
        assert len(non_overlap_id) >= queue_size
        curr_head_id = random.sample(non_overlap_id, queue_size)
        curr_remain_id = list(set(all_id) - set(curr_head_id))
        random.shuffle(curr_remain_id)
        curr_head_id.extend(curr_remain_id) 
        curr_id_list = curr_head_id
    else:
        random.shuffle(all_id)
        curr_id_list = all_id
    
    # For each ID, two images are randomly sampled
    curr_train_list =[]
    for index in curr_id_list:
        lmdb_key_list =  train_dict[index]['lmdb_key']
        if int(train_dict[index]['num_images']) >= 2:
            training_samples = random.sample(lmdb_key_list, 2)
            line = training_samples[0] + ' ' + training_samples[1]
        else:
            line = lmdb_key_list[0] + ' '+ lmdb_key_list[0]
        curr_train_list.append(line+ ' '+ str(index) +'\n')
    return curr_train_list,curr_id_list


def train_one_epoch(data_loader, probe_net, gallery_net, prototype, optimizer, 
    criterion, cur_epoch, conf):
    db_size = len(data_loader)
    check_point_size = (db_size // 2)
    batch_idx = 0
    initial_lr = get_lr(optimizer)

    probe_net.train()
    gallery_net.eval().apply(train_BN)

    for batch_idx, (images, _ ) in enumerate(data_loader):
        batch_size = images.size(0)
        global_batch_idx = (cur_epoch - 1) * db_size + batch_idx

        # the label of current batch in prototype queue
        label = (torch.LongTensor([range(batch_size)]) + global_batch_idx * batch_size) % conf.queue_size
        label = label.squeeze().cuda()
        images = images.cuda()
        x1, x2 = torch.split(images, [3, 3], dim=1)  

        # set inputs as probe or gallery 
        shuffle_ids, reshuffle_ids = shuffle_BN(batch_size)
        x1_probe = probe_net(x1)
        with torch.no_grad():
            x2 = x2[shuffle_ids]
            x2_gallery = gallery_net(x2)[reshuffle_ids]
            x2 = x2[reshuffle_ids]
     
        shuffle_ids, reshuffle_ids = shuffle_BN(batch_size)
        x2_probe = probe_net(x2)
        with torch.no_grad():
            x1 = x1[shuffle_ids]
            x1_gallery = gallery_net(x1)[reshuffle_ids]
            x1 = x1[reshuffle_ids]
            
        output1, output2  = prototype(x1_probe,x2_gallery,x2_probe,x1_gallery,label)
        loss = criterion(output1, label) + criterion(output2, label)
        
        # update
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        moving_average(probe_net, gallery_net, conf.alpha)

        if batch_idx % conf.print_freq == 0:
            loss_val = loss.item()
            lr = get_lr(optimizer)
            logger.info('epoch %d, iter %d, lr %f, loss %f'  % (cur_epoch, batch_idx, lr, loss_val))
            conf.writer.add_scalar('Train_loss', loss_val, global_batch_idx)
            conf.writer.add_scalar('Train_lr', lr, global_batch_idx)

    if cur_epoch % conf.save_freq == 0 :
        saved_name = ('{}_epoch_{}.pt'.format(conf.model_type,cur_epoch))
        torch.save(probe_net.state_dict(), os.path.join(conf.saved_dir, saved_name))
        logger.info('save checkpoint %s to disk...' % saved_name)

def train_sst(conf):
    probe_net = MobileFaceNet(conf.feat_dim)
    gallery_net = MobileFaceNet(conf.feat_dim) 
        
    moving_average(probe_net, gallery_net, 0)
    prototype = Prototype(conf.feat_dim, conf.queue_size, conf.scale,conf.margin, conf.loss_type).cuda()     
    criterion = torch.nn.CrossEntropyLoss().cuda()
    optimizer = optim.SGD(probe_net.parameters(), lr=conf.lr, momentum=conf.momentum, weight_decay=5e-4)
    lr_schedule = optim.lr_scheduler.MultiStepLR(optimizer, milestones=conf.lr_decay_epochs, gamma=0.1)
    probe_net = torch.nn.DataParallel(probe_net).cuda()
    gallery_net = torch.nn.DataParallel(gallery_net).cuda()
    train_dict = trainlist_to_dict(conf.source_file)

    for epoch in range(1, conf.epochs + 1):
        if epoch == 1:
            curr_train_list, curr_id_list = train_sample(train_dict, conf.class_num, conf.queue_size)
        else:
            curr_train_list, curr_id_list = train_sample(train_dict, conf.class_num, conf.queue_size, curr_id_list)
        data_loader = DataLoader(Dataset(conf.source_lmdb, curr_train_list, conf.key),
                                 conf.batch_size, shuffle = False, num_workers=4, drop_last = True)
        train_one_epoch(data_loader, probe_net, gallery_net, prototype, optimizer, 
            criterion, epoch, conf)
        lr_schedule.step()



if __name__ == '__main__':
    conf = argparse.ArgumentParser(description='train arcface on face database.')
    conf.add_argument('--key', type=int, default=None, help='you must give a key before training.')
    conf.add_argument("--train_db_dir", type=str, default='/export/home/data', help="input database name")
    conf.add_argument("--train_db_name", type=str, default='deepglint_unoverlap_part40', help="comma separated list of training database.")
    conf.add_argument("--train_file_dir", type=str, default='/export/home/data/deepglint_unoverlap_part40', help="input train file dir.")
    conf.add_argument("--train_file_name", type=str, default='deepglint_train_list.txt', help="input train file name.")
    conf.add_argument("--output_model_dir", type=str, default='./snapshot', help=" save model paths")
    conf.add_argument('--model_type',type=str, default='mobilefacenet',choices=['mobilefacenet'], help='choose model_type')    
    conf.add_argument('--feat_dim', type=int, default=512, help='feature dimension.')
    conf.add_argument('--queue_size', type=int, default=16384, help='size of prototype queue')
    conf.add_argument('--class_num', type=int, default=72778, help='number of categories')
    conf.add_argument('--loss_type', type=str, default='softmax',choices=['softmax','am_softmax','arc_softmax'], help="loss type, can be softmax, am or arc")
    conf.add_argument('--margin', type=float, default=0.0, help='loss margin ')
    conf.add_argument('--scale', type=float, default=30.0, help='scaling parameter ')
    conf.add_argument('--lr', type=float, default=0.05, help='initial learning rate.')
    conf.add_argument('--epochs', type=int, default=100, help='training epochs')
    conf.add_argument('--lr_decay_epochs', type=str, default='48,72,90', help='training epochs')
    conf.add_argument('--momentum', type=float, default=0.9, help='momentum')
    conf.add_argument('--alpha', type=float, default=0.999, help='weight of moving_average')
    conf.add_argument('--batch_size', type=int, default=128, help='batch size over all gpus.')
    conf.add_argument('--print_freq', type=int, default=100, help='frequency of displaying current training state.')
    conf.add_argument('--save_freq', type=int, default=1, help='frequency of saving current training state.')
    args = conf.parse_args()
    args.lr_decay_epochs = [int(p) for p in args.lr_decay_epochs.split(',')]
    args.source_file = os.path.join(args.train_file_dir, args.train_file_name)
    args.source_lmdb = os.path.join(args.train_db_dir, args.train_db_name)

    subdir =datetime.strftime(datetime.now(),'%Y%m%d_%H%M%S')
    loss_type=args.loss_type
    args.saved_dir = os.path.join(args.output_model_dir,loss_type,subdir)
    if not os.path.exists(args.saved_dir):
        os.makedirs(args.saved_dir)
    writer = SummaryWriter(log_dir=args.saved_dir)
    args.writer = writer
    logger.info('Start optimization.')
    logger.info(args)
    train_sst(args)
    logger.info('Optimization done!')

train_sst的函数主要用于训练一个名为MobileFaceNet的模型。这个函数接收一个配置对象conf作为输入参数。以下是详细的代码解释：

初始化网络 :
- probe_net 和 gallery_net 都是 MobileFaceNet 类型的网络，用于提取特征。它们被初始化为两个不同的网络实例
- moving_average(probe_net, gallery_net, 0) 可能是用于设置两个网络的平均值
初始化其他组件 :
- prototype 是一个 Prototype 类的实例，用于原型学习。它被初始化为一个GPU上的实例。
- criterion 是交叉熵损失函数，也初始化为GPU上的实例。
- optimizer 是一个随机梯度下降优化器，并设置了学习率、动量、权重衰减等参数。
- lr_schedule 是一个学习率调度器，用于在训练过程中调整学习率。
- 使用 torch.nn.DataParallel 对 probe_net 和 gallery_net 进行数据并行化，使它们能在GPU上运行。
加载训练数据 :
- 从配置文件中加载训练数据列表，并将其转换为字典形式。
训练循环 :
- 进入一个循环，从第1个epoch到指定的epoch数量。
- 在每个epoch开始时，从训练数据中采样一个新的训练列表和ID列表。
- 使用DataLoader创建一个数据加载器，用于在每个epoch中加载数据。
- 调用 train_one_epoch 函数进行一个epoch的训练。这个函数的具体实现没有给出，但从名称和参数可以推测它负责一个epoch的训练过程。
- 在每个epoch结束后，更新学习率。

从整体上看，这段代码定义了一个使用原型学习的框架，通过训练两个不同的网络（probe_net 和 gallery_net）来学习特征表示，并使用给定的损失函数和优化器进行优化。

train_one_epoch是一个深度学习训练函数，主要用于训练一个神经网络模型。以下是代码的详细解释：

函数定义 :
- train_one_epoch 是一个函数，它接受多个参数，包括数据加载器（data_loader）、探测器网络（probe_net）、画廊网络（gallery_net）、原型（prototype）、优化器（optimizer）、损失函数（criterion）、当前训练轮数（cur_epoch）和配置（conf）。
初始化 :
- db_size = len(data_loader) 获取数据加载器中的数据量。
- check_point_size = (db_size // 2)
- batch_idx = 0 初始化批量索引为0
- initial_lr = get_lr(optimizer) 获取初始学习率
设置网络模式和BN模式 :
- probe_net.train() 设置探测器网络为训练模式。
- gallery_net.eval().apply(train_BN) 设置画廊网络为评估模式，并应用训练的Batch Normalization。
循环遍历数据集 :
- 使用for循环遍历数据加载器中的所有数据。
- batch_size = images.size(0) 获取当前批次的样本数。
- global_batch_idx = (cur_epoch - 1) * db_size + batch_idx 计算全局批次索引。
处理输入数据 :
- x1, x2 = torch.split(images, [3, 3], dim=1) 将输入图像分割成两部分。
- label = ... 计算当前批次的标签。
- x1_probe = probe_net(x1) 和 x2_gallery = gallery_net(x2)[reshuffle_ids] 等是对数据进行网络前向传播，并重新排序某些数据。
计算输出和损失 :
- 使用原型（prototype）函数处理网络的输出和标签，得到最终的输出和损失。
反向传播和参数更新 :
- 使用优化器的zero_grad()方法清除梯度，然后进行反向传播和参数更新。
打印日志和保存模型 :
- 如果满足一定的打印频率（由conf.print_freq决定），则打印损失和学习率。同时，使用TensorBoard将损失和学习的学习率记录下来。
- 如果满足一定的保存频率（由conf.save_freq决定），则将模型保存到磁盘。