1 导引
我们在上一篇博客《学习理论:预测器-拒绝器多分类弃权学习》中介绍了弃权学习的基本概念和方法,其中包括了下列针对多分类问题的单阶段预测器-拒绝器弃权损失\(L_{\text{abst}}\):
\L_{\\text{abst}}(h, r, x, y) = \\underbrace{\\mathbb{I}_{\\text{h}(x) \\neq y}\\mathbb{I}_{r(x) \> 0}}_{\\text{不弃权}} + \\underbrace{c(x) \\mathbb{I}_{r(x)\\leqslant 0}}_{\\text{弃权}} \\
其中\((x, y)\in \mathcal{X}\times \mathcal{Y}\)(标签\(\mathcal{Y} = \{1, \cdots, n\}\)(\(n\geqslant 2\))),\((h, r)\in \mathcal{H}\times\mathcal{R}\)为预测器-拒绝器对(\(\mathcal{H}\)和\(\mathcal{R}\)为两个从\(\mathcal{X}\)到\(\mathbb{R}\)的函数构成的函数族),\(\text{h}(x) = \text{arg max}_{y\in \mathcal{Y}} {h(x)}_y\)直接输出实例\(x\)的预测标签。为了简化讨论,在后文中我们假设\(c\in (0, 1)\)为一个常量花费函数。
设\(\mathcal{l}\)为在标签\(\mathcal{Y}\)上定义的0-1多分类损失的代理损失,则我们可以在此基础上进一步定义弃权代理损失\(L\):
\L(h, r, x, y) = \\mathcal{l}(h, x, y)\\phi(-\\alpha r(x)) + \\psi(c) \\phi(\\beta r(x)) \\
其中\(\psi\)是非递减函数,\(\phi\)是非递增辅助函数(做为\(z \mapsto \mathbb{I}_{z \leqslant 0}\)的上界),\(\alpha\)、\(\beta\)为正常量。下面,为了简便起见,我们主要对\(\phi(z) = \exp(-z)\)进行分析,尽管相似的分析也可以应用于其它函数\(\phi\)。
在上一篇博客中,我们还提到了单阶段代理损失满足的\((\mathcal{H}, \mathcal{R})\)-一致性界:
定理 1 单阶段代理损失的\((\mathcal{H}, \mathcal{R})\) - 一致性界 假设\(\mathcal{H}\)是对称与完备的。则对\(\alpha=\beta\),\(\mathcal{l} = \mathcal{l}{\text{mae}}\),或者\(\mathcal{l} = \mathcal{l}{\rho}\)与\(\psi(z) = z\),或者\(\mathcal{l} = \mathcal{l}_{\rho - \text{hinge}}\)与\(\psi(z) = z\),有下列\((\mathcal{H}, \mathcal{R})\) - 一致性界对\(h\in \mathcal{H}, r\in \mathcal{R}\)和任意分布成立:
\R_{L_{\\text{abst}}}(h, r) - R_{L_{\\text{abst}}}\^{\*}(\\mathcal{H}, \\mathcal{R}) + M_{L_{\\text{abst}}}(\\mathcal{H}, \\mathcal{R}) \\leqslant \\Gamma(R_L(h, r) - R_{L}\^{\*}(\\mathcal{H}, \\mathcal{R}) + M_{L}(\\mathcal{H}, \\mathcal{R})) \\
其中对\(\mathcal{l} = \mathcal{l}{\text{mae}}\)取\(\Gamma (z) = \max\{2n\sqrt{z}, nz\}\);对\(\mathcal{l}=\mathcal{l}{\rho}\)取\(\Gamma (z) = \max\{2\sqrt{z}, z\}\);对\(\mathcal{l} = \mathcal{l}_{\rho - \text{hinge}}\)取\(\Gamma (z) = \max\{2\sqrt{nz}, z\}\)。
不过,在上一篇博客中,我们并没有展示单阶段代理损失的\((\mathcal{H}, \mathcal{R})\)-一致性界的详细证明过程,在这片文章里我们来看该如何对该定理进行证明(正好我导师也让我仔细看看这几篇论文中相关的分析部分,并希望我掌握单阶段方法的证明技术)。
2 一些分析的预备概念
我们假设带标签样本\(S=((x_1, y_1), \cdots, (x_m, y_m))\)独立同分布地采自\(p(x, y)\)。则对于目标损失\(L_{\text{abst}}\)和代理损失\(L\)而言,可分别定义\(L_{\text{abst}}\)-期望弃权损失\(R_{L}(h, r)\)(也即目标损失函数的泛化误差)和\(L\)-期望弃权代理损失\(R_{L}(h, r)\)(也即代理损失函数的泛化误差)如下:
\R_{L_{\\text{abst}}}(h, r) = \\mathbb{E}_{p(x, y)}\\left\[L_{\\text{abst}}(h, r, x, y)\\right, \quad R_{L}(h, r) = \mathbb{E}_{p(x, y)}\leftL(h, r, x, y)\\right \]
设\(R_{{L}^{*}{\text{abst}}}(\mathcal{H}, \mathcal{R}) = \inf{h\in \mathcal{H}, r\in \mathcal{R}}R_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R})\)和\(R_{L}^{*}(\mathcal{H}, \mathcal{R}) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}}R_{L}(\mathcal{H}, \mathcal{R})\)分别为\(R_{L_{\text{abst}}}\)和\(R_L\)在\(\mathcal{H}\times \mathcal{R}\)上的下确界。
为了进一步简化后续的分析,我们根据概率的乘法规则将\(R_L(h, r)\)写为:
\R_{L}(h, r) = \\mathbb{E}_{p(x, y)}\\left\[L(h, r, x, y)\\right = \mathbb{E}{p(x)}\underbrace{\left\\mathbb{E}_{p(y\\mid x)}\\left\[L(h, r, x, y)\\right\right]}{\text{conditional risk }C_L} \]
我们称其中内层的条件期望项为代理损失\(L\)的条件风险(conditional risk) (也称为代理损失\(L\)的pointwise风险),由于在其计算过程中\(y\)取期望取掉了,因此该项只和\(h\)、\(r\)、\(x\)相关,因此我们将其记为\(C_L(h, r, x)\):
\C_L(h, r, x) = \\mathbb{E}_{p(y\\mid x)}\\left\[L(h, r, x, y)\\right = \sum_{y\in \mathcal{Y}}p(y\mid x)L(h, r, x, y) \]
我们用\(C^*L(\mathcal{H}, \mathcal{R}, x) = \inf{h\in \mathcal{H}, r\in \mathcal{R}} C_L(h, r, x)\)来表示假设类最优(best-in-class) 的\(L\)的条件风险。同理,我们用\(C_{L_{\text{abst}}}\)来表示目标损失\(L_{\text{abst}}\)的条件风险,并用\(C^*{L{\text{abst}}}\)来表示假设类最优的\(L_{\text{abst}}\)的条件风险。
根据\(R_{L}^*(h, r)\)和\(C^*_L(\mathcal{H}, \mathcal{R}, x)\),我们可以表示出最小化能力差距(minimizability gap):
\M_L(\\mathcal{H}, \\mathcal{R}) = R_{L}\^\*(\\mathcal{H}, \\mathcal{R}) - \\mathbb{E}_{p(x)}\\left\[C_L\^\*(\\mathcal{H}, \\mathcal{R}, x)\\right \]
\(M_{L_{\text{abst}}}\)的表示同理。
于是,我们可以对要证明的\((\mathcal{H}, \mathcal{R})\)-一致性界进行改写:
\R_{L_{\\text{abst}}}(h, r) - R_{L_{\\text{abst}}}\^{\*}(\\mathcal{H}, \\mathcal{R}) + M_{L_{\\text{abst}}}(\\mathcal{H}, \\mathcal{R}) \\leqslant \\Gamma(R_L(h, r) - R_{L}\^{\*}(\\mathcal{H}, \\mathcal{R}) + M_{L}(\\mathcal{H}, \\mathcal{R}))\\\\ \\Rightarrow R_{L_{\\text{abst}}}(h, r) - \\mathbb{E}_{p(x)}\\left\[C_{L_{\\text{abst}}}\^\*(\\mathcal{H}, \\mathcal{R}, x)\\right \leqslant \Gamma\left(R_{L}(h, r) - \mathbb{E}_{p(x)}\leftC_{L}\^\*(\\mathcal{H}, \\mathcal{R}, x)\\right\right) \]
其中\(R_{L_{\text{abst}}}(h, r)\)和\(R_L(h, r)\)分别为\(\mathbb{E}{p(x)}C{L_{\text{abst}}}(h, r, x)\)和\(\mathbb{E}{p(x)}C{L}(h, r, x)\),于是上述不等式即为
\\\mathbb{E}_{p(x)}\\underbrace{\\left\[C_{L_{\\text{abst}}}(h, r, x) - C_{L_{\\text{abst}}}\^\*(\\mathcal{H}, \\mathcal{R}, x)\\right}{\Delta C{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)} \leqslant \Gamma\left(\mathbb{E}{p(x)}\underbrace{\leftC_{L}(h, r, x) - C_{L}\^\*(\\mathcal{H}, \\mathcal{R}, x)\\right}{\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)}\right) \]
我们将上述不等式两边的被取期望的项简记为\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)和\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\),其中\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)被称为校准差距(calibration gap) 。由于按定义\(\Gamma(\cdot)\)是凹函数,由Jensen不等式有:
\\\mathbb{E}_{p(x)}\\left\[\\Gamma\\left(\\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x)\\right)\\right \leqslant \Gamma\left(\mathbb{E}_{p(x)}\left\\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x)\\right\right) \]
于是,若我们能证明下述不等式,则原不等式得证:
\\\Delta C_{L_\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x) \\leqslant \\Gamma \\left(\\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x)\\right) \\
我们后面将会看到,\((\mathcal{H}, \mathcal{R})\)-一致性界的证明过程中重要的一步即是证明\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)能被\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)界定。
3 \(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)的表示
我们先来看\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = C_{L_{\text{abst}}}(h, r, x) - C^*{L{\text{abst}}}(\mathcal{H}, \mathcal{R}, x)\)如何表示。根据定义,我们有:
\\\begin{aligned} C_{L_{\\text{abst}}}(h, r, x) \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x)L_{\\text{abst}}(h, r, x, y) \\\\ \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\mathbb{I}_{\\text{h}(x) \\neq y}\\mathbb{I}_{r(x) \> 0} + c(x) \\mathbb{I}_{r(x)\\leqslant 0} \\end{aligned} \\
由于是关于\(y\)的条件期望,上式最后一行中只需要对\(\mathbb{I}{\text{h}(x) \neq y}\)进行加权求和即可。为了进一步对\(C{L_{\text{abst}}}(h, r, x)\)进行表示,我们需要对\(r(x)\)的正负情况进行分类讨论:
- \(r(x) > 0\):此时\(C_{L_{\text{abst}}}(h, r, x) = \sum_{y\in \mathcal{Y}}p(y\mid x) \mathbb{I}_{\text{h}(x) \neq y} = 1 - p(\text{h}(x)\mid x)\)。
- \(r(x) \leqslant 0\):此时\(C_{L_{\text{abst}}}(h, r, x) = c\)。
接下来我们来看\(C^*{L{\text{abst}}}\)如何表示。我们假设拒绝函数集\(\mathcal{R}\)是完备的(也即对任意\(x\in \mathcal{X}, \{r(x): r\in \mathcal{R}\} = \mathbb{R}\)),那么\(\mathcal{R}\)也是弃权正规的(也即使得对任意\(x\in \mathcal{X}\),存在\(r_1, r_2\in \mathcal{R}\)满足\(r_1(x) > 0\)与\(r_2(x) \leqslant 0\))。于是我们有
\\\begin{aligned} C\^\*_{L_{\\text{abst}}}(\\mathcal{H}, \\mathcal{R}, x) \&= \\inf_{h\\in \\mathcal{H}, r\\in \\mathcal{R}}C_{L_{\\text{abst}}}(h, r, x)\\\\ \& = \\min \\left\\{\\min_{h\\in \\mathcal{H}}\\left(1 - p\\left( \\text{h}(x)\\mid x\\right)\\right), c\\right\\}\\\\ \& = 1 - \\max\\left\\{\\max_{h\\in \\mathcal{H}}p\\left(\\text{h}(x)\\mid x\\right), 1 - c\\right\\} \\end{aligned} \\
我们假设\(\mathcal{H}\)是对称的且完备的(具体定义参见博客《学习理论:预测器-拒绝器多分类弃权学习》),则我们有\(\{\text{h}(x): h\in \mathcal{H}\} = \mathcal{Y}\),于是
\C\^\*_{L_{\\text{abst}}}(\\mathcal{H}, \\mathcal{R}, x) = 1 - \\max\\left\\{\\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right), 1 - c\\right\\} \\
为了进一步对\(C^*{L{\text{abst}}}(\mathcal{H}, \mathcal{y}, x)\)进行表示,我们需要对\(\max_{y\in \mathcal{Y}}p(y\mid x)\)和\((1 - c)\)的大小比较情况进行分类讨论:
- \(\max_{y\in \mathcal{Y}}p(y\mid x) > 1 - c\):此时\(C^*{L{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = 1 - \max_{y\in \mathcal{Y}}p(y\mid x)\)。
- \(\max_{y\in \mathcal{Y}}p(y\mid x) \leqslant 1 - c\):此时\(C^*{L{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = c\)。
于是,我们有:
\\\begin{aligned} \\Delta C_{L_\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x) \&= C_{L_{\\text{abst}}}(h, r, x) - C\^\*_{L_{\\text{abst}}}(\\mathcal{H}, \\mathcal{R}, x) \\\\ \& = \\left\\{\\begin{aligned} \&\\max_{y\\in \\mathcal{Y}}p(y\\mid x) - p(\\text{h}(x)\\mid x)\\quad \&\\text{if } \\max_{y\\in \\mathcal{Y}} p(y\\mid x) \> (1 - c),r(x) \> 0 \\\\ \&1 - c - p(\\text{h}(x)\\mid x) \\quad \&\\text{if } \\max_{y\\in \\mathcal{Y}} p(y\\mid x) \\leqslant (1 - c),r(x) \> 0 \\\\ \&0 \\quad \&\\text{if } \\max_{y\\in \\mathcal{Y}} p(y\\mid x) \\leqslant (1 - c),r(x) \\leqslant 0 \\\\ \&\\max_{y\\in \\mathcal{Y}}p(y\\mid x) - 1 + c \\quad \&\\text{if } \\max_{y\\in \\mathcal{Y}} p(y\\mid x) \> (1 - c),r(x) \\leqslant 0 \\\\ \\end{aligned}\\right. \\end{aligned} \\
4 \(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)的表示
4.1 分类讨论的准备
接下来我们来看\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) = C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x)\)如何表示。根据定义,若\(\alpha = \beta\),\(\phi(z) = \exp(-z)\),我们有:
\\\begin{aligned} C_L(h, r, x) \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x)L(h, r, x, y) \\\\ \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\mathcal{l}(h, x, y)e\^{\\alpha r(x)} + \\psi(c) e\^{-\\alpha r(x)} \\end{aligned} \\
由于是关于\(y\)的条件期望,上式最后一行中只需要对\(\mathcal{l}(h, x, y)\)进行加权求和即可。在后文中我们将会针对下列三种不同的\(\mathcal{l}\)函数以及\(\psi(z)\)的选择情况来分别对\(C_L(h, r, x)\)进行讨论:
- \(\mathcal{l} = \mathcal{l}_{\text{mae}}\),\(\psi(z) = z\);
- \(\mathcal{l} = \mathcal{l}_{\rho}\),\(\psi(z) = z\);
- \(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\),\(\psi(z) = nz\)。
注 这三种不同\(\mathcal{l}\)的定义参见博客《学习理论:预测器-拒绝器多分类弃权学习》),我在这里把它们的定义贴一下:
- 平均绝对误差损失:\(\mathcal{l}{\text{mae}}(h, x, y) = 1 - \frac{e^{{h(x)}y}}{\sum{y^{\prime}\in \mathcal{Y}}e^{{h(x)}{y^{\prime}}}}\);
- 约束\(\rho\)-合页损失:\(\mathcal{l}{\rho-\text{hinge}}(h, x, y) = \sum{y^{\prime}\neq y}\phi_{\rho-\text{hinge}}(-{h(x)}{y^{\prime}}), \rho > 0\),其中\(\phi{\rho-\text{hinge}}(z) = \max\{0, 1 - \frac{z}{\rho}\}\)为\(\rho\)-合页损失,且约束条件\(\sum_{y\in \mathcal{Y}}{h(x)}_y=0\)。
- \(\rho\)-间隔损失:\(\mathcal{l}{\rho}(h, x, y) = \phi{\rho}({\rho_h (x, y)})\),其中\(\rho_{h}(x, y) = h(x)y - \max{y^{\prime} \neq y}h(x){y^{\prime}}\)是置信度间隔,\(\phi{\rho}(z) = \min\{\max\{0, 1 - \frac{z}{\rho}\}, 1\}, \rho > 0\)为\(\rho\)-间隔损失。
4.2 \(\mathcal{l} = \mathcal{l}_{\text{mae}}\),\(\psi(z) = z\)
在这种情况下\(C_L(h, r, x)\)可以表示为:
\\\begin{aligned} C_L(h, r, x) \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\underbrace{\\left(1 - \\frac{e\^{{h(x)}_y}}{\\sum_{y\^{\\prime}\\in \\mathcal{Y}}e\^{{h(x)}_{y\^{\\prime}}}}\\right)}_{\\mathcal{l}_{\\text{mae}}}e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} \\\\ \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} \\end{aligned} \\
其中\(s_h(x, y) = \frac{e^{{h(x)}y}}{\sum{y^{\prime}\in \mathcal{Y}}e^{{h(x)}_{y^{\prime}}}}\)。
于是
\\\begin{aligned} C_L\^\*(\\mathcal{H}, \\mathcal{R}, x) \&= \\inf_{h\\in \\mathcal{H}, r\\in\\mathcal{R}} \\left\\{\\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)}\\right\\} \\\\ \&= \\inf_{r\\in\\mathcal{R}} \\left\\{\\inf_{h\\in \\mathcal{H}}\\left\\{\\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)\\right\\}e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)}\\right\\} \\end{aligned} \\
由于假设了\(\mathcal{H}\)是对称的与完备的,我们有
\\\begin{aligned} \&\\inf_{h\\in \\mathcal{H}}\\left\\{\\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y\\right))\\right\\} \\\\ \&= 1 - \\sup_{h\\in \\mathcal{H}}\\sum_{y\\in \\mathcal{Y}}p(y\\mid x)s_h(x, y) \\\\ \&= 1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\quad \\left(s_h(x, y)\\in (0, 1)\\right) \\end{aligned} \\
注 实际上,对任意\(h\in \mathcal{H}\),有:
\\\begin{aligned} \&\\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right) - \\left(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right)\\right) \\\\ \&= \\max_{y\\in \\mathcal{Y}} p(y\\mid x) - \\sum_{y\\in \\mathcal{Y}}p(y\\mid x)s_h(x, y) \\\\ \&= \\max_{y\\in \\mathcal{Y}} p(y\\mid x) - \\left(p\\left(\\text{h}(x)\\mid x\\right)s_h\\left(x, \\text{h}(x)\\right) + \\sum_{y\\neq \\text{h}(x)}p(y\\mid x)s_h(x, y)\\right) \\\\ \&\\geqslant \\max_{y\\in \\mathcal{Y}} p(y\\mid x) - \\left(p\\left(\\text{h}(x)\\mid x\\right)s_h\\left(x, \\text{h}(x)\\right) + \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\left(1 - s_h\\left(x, \\text{h}(x)\\right)\\right)\\right) \\\\ \&= s_h\\left(x, \\text{h}(x)\\right)\\left(\\max_{y\\in \\mathcal{Y}}p(y\\mid x) - p\\left(\\text{h}(x)\\mid x\\right)\\right) \\\\ \&\\geqslant \\frac{1}{n} \\left(\\max_{y\\in \\mathcal{Y}}p(y\\mid x) - p\\left(\\text{h}(x)\\mid x\\right)\\right) \\end{aligned} \\
这个结论我们会在后面的证明中多次用到。该结论的一个推论是如果分类器\(h^*\)为贝叶斯最优分类器(也即\(p(\text{h}^*(x)\mid x) = \max_{y\in \mathcal{Y}} p(y\mid x)\)),则\(\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right) \geqslant 0\),可直观地将其理解为\(\mathbb{E}_{p(y\mid x)}\left\\mathcal{l}_{\\text{mae}}\\right\)更可能接近其下确界。
于是
\C_L\^\*(\\mathcal{H}, \\mathcal{R}, x) = \\inf_{r\\in\\mathcal{R}} \\left\\{\\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)}\\right\\} \\
记上式中需要求极值的部分为泛函\(F(r)\),则其泛函导数为
\\\frac{\\delta F}{\\delta r(x)} = \\alpha \\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right)e\^{\\alpha r(x)} - c\\alpha e\^{-\\alpha r(x)} \\
令\(\frac{\delta F}{\delta p(x)} = 0\)(对\(\forall x\in \mathcal{X}\)),解得\(r^*(x) = -\frac{1}{2\alpha}\log \left(\frac{1 - \max_{y\in \mathcal{Y}}p(y\mid x)}{c}\right)\)。将其代入\(F(r)\)可得:
\C_L\^\*(\\mathcal{H}, \\mathcal{R}, x) = 2\\sqrt{c(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x))} \\
于是
\\\begin{aligned} \\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \&= C_L(h, r, x) - C\^\*_L(\\mathcal{H}, \\mathcal{R}, x) \\\\ \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} - 2\\sqrt{c(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x))} \\end{aligned} \\
为了构建\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)和\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)的不等式关系,接下来我们将会采用第3节中类似的做法,针对\(\max_{y\in \mathcal{Y}} p(y\mid x)\)与\(1 - c\)的大小比较情况与\(r(x)\)的正负情况来对\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)进行分类讨论:
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) > 0\):
此时
\\\begin{aligned} \\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} - 2\\sqrt{c\\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right)} \\\\ \& \\geqslant \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} - \\left(c + \\underbrace{\\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right)}_{\
(其中\(\text{AM-GM inequality}\)为算术-几何平均值不等式)
取\(\Gamma (z) = nz\),于是\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)得证。
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) > 0\):
\ \\begin{aligned} \\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} - 2\\sqrt{c\\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right)} \\\\ \& \\geqslant \\underbrace{\\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h\\left(x, y\\right)\\right)}_{\\geqslant c}e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} - 2\\sqrt{c\\left(\\sum_{y\\in \\mathcal{Y}}p(y\\mid x)\\left(1 - s_h(x, y)\\right)\\right)} \\\\ \& \\geqslant \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right) + c - 2\\sqrt{c\\left(\\sum_{y\\in \\mathcal{Y}}p(y\\mid x)\\left(1 - s_h(x, y)\\right)\\right)} \\\\ \&= \\left(\\sqrt{\\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)} - \\sqrt{c}\\right)\^2 \\\\ \&= \\left(\\frac{\\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right) - c}{\\sqrt{\\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)} + \\sqrt{c}}\\right)\^2 \\\\ \&\\geqslant \\left(\\frac{\\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right) - \\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right) + \\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x) - c\\right)}{2}\\right)\^2 \\\\ \&\\geqslant \\left(\\frac{\\frac{1}{n} \\left(\\max_{y\\in \\mathcal{Y}}p(y\\mid x) - p\\left(\\text{h}(x)\\mid x\\right)\\right) + \\frac{1}{n}\\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x) - c\\right)}{2}\\right)\^2 \\\\ \&= \\frac{1}{4n\^2}\\left(1 - c - p\\left(\\text{h}(x)\\mid x\\right)\\right)\^2 \\\\ \&= \\frac{\\Delta C_{\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x)\^2}{4n\^2} \\end{aligned} \\
取\(\Gamma (z) = 2n\sqrt{z}\),于是\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)得证。
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) \leqslant 0\):
由于此时\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0\),因此\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma\left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)对任意\(\Gamma \geqslant 0\)成立。
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) \leqslant 0\):
\ \\begin{aligned} \\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\left(1 - s_h(x, y)\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} - 2\\sqrt{c\\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right)} \\\\ \&\\geqslant \\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right)\\underbrace{e\^{\\alpha r(x)}}_{\\leqslant 1} + c \\underbrace{e\^{-\\alpha r(x)}}_{\\geqslant 1} - 2\\sqrt{c\\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right)} \\\\ \&\\geqslant 1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x) + c - 2\\sqrt{c\\left(1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)\\right)} \\\\ \&= \\left(\\sqrt{1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)} - \\sqrt{c}\\right)\^2 \\\\ \&= \\left(\\frac{1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x) - c}{\\sqrt{1 - \\max_{y\\in \\mathcal{Y}}p(y\\mid x)} + \\sqrt{c}}\\right)\^2 \\\\ \&\\geqslant \\left(\\frac{\\max_{y\\in \\mathcal{Y}}p(y\\mid x) - 1 + c}{2}\\right)\^2 \\\\ \&= \\frac{\\Delta C_{\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x)\^2}{4} \\end{aligned} \\
取\(\Gamma (z) = 2\sqrt{z}\),于是\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)得证。
综上所述,若取\(\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2n\sqrt{z}, nz\}\),则恒有\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)。于是\(\mathcal{l} = \mathcal{l}_{\text{mae}}\),\(\psi(z) = z\)时单阶段代理损失的\((\mathcal{H}, \mathcal{R})\)-一致性界得证。
4.3 \(\mathcal{l} = \mathcal{l}_{\rho}\),\(\psi(z) = z\)
在这种情况下\(C_L(h, r, x)\)可以表示为:
\\\begin{aligned} C_L(h, r, x) \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\underbrace{\\min\\left\\{\\max\\left\\{0, 1 - \\frac{\\rho_h(x, y)}{\\rho}\\right\\}, 1\\right\\}}_{\\mathcal{l}_{\\rho}}e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} \\\\ \&= \\left(1 - \\sum_{y\\in \\mathcal{Y}} p(y\\mid x)\\max\\left\\{\\min\\left\\{1, \\frac{\\rho_h(x, y)}{\\rho}\\right\\}, 0\\right\\}\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} \\\\ \&= \\left(1 - \\sum_{y\\in \\mathcal{Y}} p(y\\mid x)\\min\\left\\{1, \\frac{\\rho_h(x, y)}{\\rho}\\right\\}\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} \\end{aligned} \\
其中\(\rho_h(x, y) = h(x)y - \max{y^{\prime}\neq y}h(x)_{y^{\prime}}\)为间隔。
由于假设了\(\mathcal{H}\)是对称的与完备的,我们有
\\\begin{aligned} \&\\inf_{h\\in \\mathcal{H}}\\left\\{1 - \\sum_{y\\in \\mathcal{Y}} p (y\\mid x)\\min\\left\\{1, \\frac{\\rho_h(x, y)}{\\rho}\\right\\}\\right\\} \\\\ \&= 1 - \\sup_{h\\in \\mathcal{H}}\\sum_{y\\in \\mathcal{Y}}p(y\\mid x)\\min\\left\\{1, \\frac{\\rho_h(x, y)}{\\rho}\\right\\} \\\\ \&= 1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right)\\quad (\\min\\left\\{1, \\frac{\\rho_h(x, y)}{\\rho}\\right\\}\\in \[0, 1) \end{aligned} \]
注 实际上,对任意\(h\in \mathcal{H}\),有:
\\\begin{aligned} \&\\left(1 - \\sum_{y\\in \\mathcal{Y}} p (y\\mid x)\\min\\left\\{1, \\frac{\\rho_h(x, y)}{\\rho}\\right\\}\\right) - \\left(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right)\\right) \\\\ \&= \\max_{y\\in \\mathcal{Y}} p(y\\mid x) - \\sum_{y\\in \\mathcal{Y}} p (y\\mid x)\\min\\left\\{1, \\frac{\\rho_h(x, y)}{\\rho}\\right\\} \\\\ \&= \\max_{y\\in \\mathcal{Y}} p(y\\mid x) - \\min \\left\\{1, \\frac{\\rho_h\\left(x, \\text{h}(x)\\right)}{\\rho}\\right\\}p\\left(\\text{h}(x)\\mid x\\right) \\\\ \&\\geqslant \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right) - p\\left(\\text{h}(x)\\mid x\\right) \\end{aligned} \\
和之前\(\mathcal{l}_{\text{mae}}\)的证明类似,这个结论我们会在后面的证明中多次用到。
于是和之前\(\mathcal{l}_{\text{mae}}\)类似,我们有
\C_L\^\*(\\mathcal{H}, \\mathcal{R}, x) = 2\\sqrt{c(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right))} \\
于是
\\\begin{aligned} \\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \&= C_L(h, r, x) - C\^\*_L(\\mathcal{H}, \\mathcal{R}, x) \\\\ \&= \\left(1 - \\sum_{y\\in \\mathcal{Y}} p (y\\mid x)\\min\\left\\{1, \\frac{\\rho_h(x, y)}{\\rho}\\right\\}\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} - 2\\sqrt{c(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right))} \\end{aligned} \\
为了构建\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)和\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)的不等式关系,接下来我们将会采用\(\mathcal{l}{\text{mae}}\)的证明中类似的做法,针对\(\max{y\in \mathcal{Y}} p(y\mid x)\)与\(1 - c\)的大小比较情况与\(r(x)\)的正负情况来对\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)进行分类讨论:
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) > 0\):
此时
\\\begin{aligned} \\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \&= \\left(1 - \\sum_{y\\in \\mathcal{Y}} p (y\\mid x)\\min\\left\\{1, \\frac{\\rho_h(x, y)}{\\rho}\\right\\}\\right)e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} - 2\\sqrt{c\\left(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right)\\right)} \\\\ \&\\geqslant \\frac{1}{4}\\left(1 - c - p\\left(\\text{h}\\left(x\\right)\\mid x\\right)\\right)\^2 \\\\ \&= \\frac{\\Delta C_{\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x)\^2}{4} \\end{aligned} \\
(由于证明步骤与\(\mathcal{l}_{\text{mae}}\)类似,这里对证明步骤进行了一些精简,下面同理)
取\(\Gamma_2 (z) = 2\sqrt{z}\),于是\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)得证。
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) > 0\):
\\\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \\geqslant \\max_{y\\in \\mathcal{Y}}p(y\\mid x) - p(\\text{h}(x)\\mid x) = \\Delta C_{\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x) \\
取\(\Gamma_1 (z) = z\),于是\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)得证。
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) \leqslant 0\):
由于此时\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0\),因此\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)对任意\(\Gamma \geqslant 0\)成立。
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) \leqslant 0\):
\\\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \\geqslant \\left(\\frac{\\max_{y\\in \\mathcal{Y}}p(y\\mid x) - 1 + c}{2}\\right)\^2 = \\frac{\\Delta C_{\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x)\^2}{4} \\
取\(\Gamma_3 (z) = 2\sqrt{z}\),于是\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)得证。
综上所述,若取\(\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2\sqrt{z}, z\}\),则恒有\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)。于是\(\mathcal{l} = \mathcal{l}_{\rho}\),\(\psi(z) = z\)时单阶段代理损失的\((\mathcal{H}, \mathcal{R})\)-一致性界得证。
4.4 \(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\),\(\psi(z) = nz\)
在这种情况下\(C_L(h, r, x)\)可以表示为:
\\\begin{aligned} C_L(h, r, x) \&= \\sum_{y\\in \\mathcal{Y}}p(y\\mid x) \\underbrace{\\sum_{y\^{\\prime} \\neq y}\\max\\left\\{0, 1 + \\frac{h(x)_{y\^{\\prime}}}{\\rho}\\right\\}}_{\\mathcal{l}_{\\rho}-\\text{hinge}}e\^{\\alpha r(x)} + nce\^{-\\alpha r(x)} \\\\ \&= \\sum_{y\\in \\mathcal{Y}}\\left(1 - p(y\\mid x)\\right)\\max\\left\\{0, 1 + \\frac{h(x)_y}{\\rho}\\right\\}e\^{\\alpha r(x)} + nce\^{-\\alpha r(x)} \\\\ \\end{aligned} \\
由于假设了\(\mathcal{H}\)是对称的与完备的,我们有
\\\begin{aligned} \&\\inf_{h\\in \\mathcal{H}}\\left\\{\\sum_{y\\in \\mathcal{Y}}\\left(1 - p(y\\mid x)\\right)\\max\\left\\{0, 1 + \\frac{h(x)_y}{\\rho}\\right\\}\\right\\} \\\\ \&= n - \\sup_{h\\in \\mathcal{H}}\\sum_{y\\in \\mathcal{Y}}p(y\\mid x)\\max\\left\\{0, 1 + \\frac{h(x)_y}{\\rho}\\right\\} \\\\ \&= n\\left(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right)\\right) \\end{aligned} \\
注 实际上,若取\(h_{\rho}\)使得\(h_{\rho}(x)y = \left\{\begin{aligned} &h(x)y\quad &\text{if } y\notin \left\{y{\max}, \text{h}(x)\right\} \\ &-\rho \quad &\text{if } y = \text{h}(x) \\ &h\left(x\right){y_{\text{max}}} + h\left(x\right){\text{h}(x)} + \rho \quad &\text{if } y = y{\text{max}} \\ \end{aligned}\right.\)满足约束\(\sum_{y\in \mathcal{Y}}h_{\rho}(y\mid x)=0\),其中\(y_{\max} = \text{arg max}_{y\in \mathcal{Y}}p(y\mid x)\),则对任意\(h\in \mathcal{H}\)有:
\\\begin{aligned} \&\\sum_{y\\in \\mathcal{Y}}\\left(1 - p(y\\mid x)\\right)\\max\\left\\{0, 1 + \\frac{h(x)_y}{\\rho}\\right\\} - n\\left(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right)\\right) \\\\ \&\\geqslant \\sum_{y\\in \\mathcal{Y}}\\left(1 - p(y\\mid x)\\right)\\min\\left\\{n, \\max\\left\\{0, 1 + \\frac{h(x)_y}{\\rho}\\right\\}\\right\\} - n\\left(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right)\\right) \\\\ \&\\geqslant \\sum_{y\\in \\mathcal{Y}}\\left(1 - p(y\\mid x)\\right)\\min\\left\\{n, \\max\\left\\{0, 1 + \\frac{h(x)_y}{\\rho}\\right\\}\\right\\} \\\\ \&\\quad - \\sum_{y\\in \\mathcal{Y}}\\left(1 - p(y\\mid x)\\right)\\min\\left\\{n, \\max\\left\\{0, 1 + \\frac{h_{\\rho}(x)_y}{\\rho}\\right\\}\\right\\} \\\\ \&= \\left(p(y_{\\text{max}}\\mid x) - p(\\text{h}(x)\\mid x)\\right)\\min\\left\\{n, 1 + \\frac{h(x)_{\\text{h}(x)}}{\\rho}\\right\\} \\\\ \&\\geqslant \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right) - p\\left(\\text{h}\\left(x\\right)\\mid x\\right) \\end{aligned} \\
和之前\(\mathcal{l}{mae}\)、\(\mathcal{l}{\rho}\)的证明类似,这个结论我们会在后面的证明中多次用到。
于是和之前\(\mathcal{l}{mae}\)、\(\mathcal{l}{\rho}\)类似,我们有
\C_L\^\*(\\mathcal{H}, \\mathcal{R}, x) = 2\\sqrt{n\^2c(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right))} \\
于是
\\\begin{aligned} \\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \&= C_L(h, r, x) - C\^\*_L(\\mathcal{H}, \\mathcal{R}, x) \\\\ \&= \\sum_{y\\in \\mathcal{Y}}\\left(1 - p(y\\mid x)\\right)\\max\\left\\{0, 1 + \\frac{h(x)_y}{\\rho}\\right\\}e\^{\\alpha r(x)} + c e\^{-\\alpha r(x)} - 2\\sqrt{c(1 - \\max_{y\\in \\mathcal{Y}}p\\left(y\\mid x\\right))} \\end{aligned} \\
为了构建\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)和\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)的不等式关系,接下来我们将会采用\(\mathcal{l}{\text{mae}}\)、\(\mathcal{l}{\rho}\)的证明中类似的做法,针对\(\max_{y\in \mathcal{Y}} p(y\mid x)\)与\(1 - c\)的大小比较情况与\(r(x)\)的正负情况来对\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)进行分类讨论:
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) > 0\):
此时
\\\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \\geqslant \\max_{y\\in \\mathcal{Y}}p(y\\mid x) - p(\\text{h}(x)\\mid x) = \\Delta C_{\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x)\^2 \\
取\(\Gamma_1 (z) = z\),于是\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)得证。
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) > 0\):
\\\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \\geqslant \\frac{1}{4n}\\left(1 - c - p\\left(\\text{h}\\left(x\\right)\\mid x\\right)\\right)\^2 = \\frac{\\Delta C_{\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x)\^2}{4n} \\
取\(\Gamma_1 (z) = 2\sqrt{nz}\),于是\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)得证。
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) \leqslant 0\):
由于此时\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0\),因此\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)对任意\(\Gamma \geqslant 0\)成立。
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) \leqslant 0\):
\\\Delta C_{L, \\mathcal{H}, \\mathcal{R}}(h, r, x) \\geqslant n\\left(\\frac{\\max_{y\\in \\mathcal{Y}}p(y\\mid x) - 1 + c}{2}\\right)\^2 = \\frac{n\\Delta C_{\\text{abst}, \\mathcal{H}, \\mathcal{R}}(h, r, x)\^2}{4} \\
取\(\Gamma_3 (z) = 2\sqrt{z/n}\),于是\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)得证。
综上所述,若取\(\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2\sqrt{nz}, z\}\),则恒有\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)。于是\(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\),\(\psi(z) = nz\)时单阶段代理损失的\((\mathcal{H}, \mathcal{R})\)-一致性界得证。
参考
- 1 Mao A, Mohri M, Zhong Y. Predictor-rejector multi-class abstention: Theoretical analysis and algorithmsC//International Conference on Algorithmic Learning Theory. PMLR, 2024: 822-867.
- 2 Cortes C, DeSalvo G, Mohri M. Boosting with abstentionJ. Advances in Neural Information Processing Systems, 2016, 29.
- 3 Ni C, Charoenphakdee N, Honda J, et al. On the calibration of multiclass classification with rejectionJ. Advances in Neural Information Processing Systems, 2019, 32.
- 4 Han Bao: Learning Theory Bridges Loss Functions
- 5 Crammer K, Singer Y. On the algorithmic implementation of multiclass kernel-based vector machinesJ. Journal of machine learning research, 2001, 2(Dec): 265-292.
- 6 Awasthi P, Mao A, Mohri M, et al. Multi-Class H -Consistency BoundsJ. Advances in neural information processing systems, 2022, 35: 782-795.