【Machine Learning】Generalization Theory

本笔记基于清华大学《机器学习》的课程讲义中泛化理论相关部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。

No free lunch

  • For algroithm A ′ A' A′, exsits f f f that is perfect answer of D ∈ C × { 0 , 1 } D\in C\times\{0,1\} D∈C×{0,1}, such that L D ( f ) = 0 L_D(f)=0 LD(f)=0 and

E S ∼ D m [ L D ( A ′ ( S ) ) ] ≥ 1 4 E_{S\sim D^m}[L_D(A'(S))]\ge \frac{1}{4} ES∼Dm[LD(A′(S))]≥41

  • Then Pr ⁡ [ L D ( A ′ ( S ) ) ≥ 1 8 ] ≥ 1 7 \Pr[L_D(A'(S))\ge \frac{1}{8}]\ge \frac{1}{7} Pr[LD(A′(S))≥81]≥71

  • Proof:
    max ⁡ i E S ∼ D i m [ L D ( A ′ ( S ) ) ] = max ⁡ i 1 k ∑ i = 1 k L D i ( A ′ ( S i ) ) ≥ 1 T ∑ j = 1 T 1 k ∑ i = 1 k L D j ( A ′ ( S i ) ) ≥ 1 k ∑ i = 1 k 1 T ∑ j = 1 T L D j ( A ′ ( S i ) ) ≥ min ⁡ S 1 T ∑ j = 1 T L D j ( A ′ ( S ) ) ≥ min ⁡ S 1 T ∑ j = 1 T 1 2 m ∑ i = 1 p 1 A ′ wrong at v i ≥ min ⁡ S 1 T ∑ j = 1 T 1 2 p ∑ i = 1 p 1 A ′ wrong at v i ≥ 1 2 min ⁡ S 1 T ∑ j = 1 T min ⁡ i 1 A ′ wrong at v i ≥ 1 4 \begin{align*} \max_{i}E_{S\sim D_i^m}[L_D(A'(S))]&=\max_{i}\frac{1}{k}\sum_{i=1}^k L_{D_i}(A'(S_i))\\ &\ge \frac{1}{T}\sum_{j=1}^T\frac{1}{k}\sum_{i=1}^k L_{D_j}(A'(S_i))\\ &\ge \frac{1}{k}\sum_{i=1}^k\frac{1}{T}\sum_{j=1}^T L_{D_j}(A'(S_i))\\ &\ge \min_S\frac{1}{T}\sum_{j=1}^T L_{D_j}(A'(S))\\ &\ge \min_S\frac{1}{T}\sum_{j=1}^T \frac{1}{2m}\sum_{i=1}^p1_{A'\text{ wrong at }v_i}\\ &\ge \min_S\frac{1}{T}\sum_{j=1}^T \frac{1}{2p}\sum_{i=1}^p1_{A'\text{ wrong at }v_i}\\ &\ge \frac{1}{2}\min_S\frac{1}{T}\sum_{j=1}^T \min_{i} 1_{A'\text{ wrong at }v_i}\\ &\ge \frac{1}{4} \end{align*} imaxES∼Dim[LD(A′(S))]=imaxk1i=1∑kLDi(A′(Si))≥T1j=1∑Tk1i=1∑kLDj(A′(Si))≥k1i=1∑kT1j=1∑TLDj(A′(Si))≥SminT1j=1∑TLDj(A′(S))≥SminT1j=1∑T2m1i=1∑p1A′ wrong at vi≥SminT1j=1∑T2p1i=1∑p1A′ wrong at vi≥21SminT1j=1∑Timin1A′ wrong at vi≥41

    • The last inequality is beause divide T T T into 2 2 2 parts. One pair f i , f i ′ f_i,f_{i'} fi,fi′only differs at v i v_i vi.

ERM

  • With realizable assumption, the hypothesis class found by ERM is good enough with at least some samples

    • Consider the probability of bad samples L S ( h S ) = L S ( h ∗ ) = 0 L_S(h_S)=L_S(h^*)=0 LS(hS)=LS(h∗)=0 but L D , f ( h S ) > ϵ L_{D,f}(h_S)>\epsilon LD,f(hS)>ϵ. Then we need S S S to be the union(apply union bound) of misleading set L S ( h S ) = 0 L_S(h_S)=0 LS(hS)=0, each sample has probability ≤ 1 − ϵ \le 1-\epsilon ≤1−ϵ. Then probability is ∣ H B ∣ ( 1 − ϵ ) m |H_B|(1-\epsilon)^m ∣HB∣(1−ϵ)m
  • PAC learnable: As sample number m ≥ m ( ϵ , δ ) m\ge m(\epsilon,\delta) m≥m(ϵ,δ), w.p. 1 − δ 1-\delta 1−δ we can find a h h h such that L D , f ( h ) ≤ ϵ L_{D,f}(h)\le \epsilon LD,f(h)≤ϵ.

    • Agnostic PAC learnable: L D ( h ) ≤ L D ( h ∗ ) + ϵ L_{D}(h)\le L_{D}(h^*)+\epsilon LD(h)≤LD(h∗)+ϵ
  • VC dimension

Rademacher

  • Generalization:
    L D ( h ) − L S ( h ) ≤ 2 E S ′ ∼ D m R ( l ∘ H ∘ S ′ ) + c 2 ln ⁡ 2 δ m L_D(h)-L_S(h)\le 2E_{S'\sim D^m}R(l\circ H\circ S')+c\sqrt{\frac{2\ln\frac{2}{\delta}}{m}} LD(h)−LS(h)≤2ES′∼DmR(l∘H∘S′)+cm2lnδ2

  • Massart Lemma:
    R ( A ) ≤ max ⁡ a ∈ A ∣ a − a ˉ ∣ 2 log ⁡ N m R(A)\le \max_{a\in A}|a-\bar{a}|\frac{\sqrt{2\log N}}{m} R(A)≤a∈Amax∣a−aˉ∣m2logN

  • Contraction Lemma: If ϕ \phi ϕ is ρ \rho ρ-lipschitz, then
    R ( ϕ ∘ A ) ≤ ρ R ( A ) R(\phi\circ A)\le \rho R(A) R(ϕ∘A)≤ρR(A)

相关推荐
机 _ 长15 分钟前
Mamba项目实战-Ubuntu
人工智能·深度学习·ubuntu
FIT2CLOUD飞致云16 分钟前
MaxKB开源知识库问答系统发布v1.3.0版本,新增强大的工作流引擎
运维·人工智能·开源
鹅毛在路上了26 分钟前
昇思25天学习打卡营第5天|GAN图像生成
人工智能·生成对抗网络·mindspore
硅纪元27 分钟前
硅纪元视角 | AI纳米机器人突破癌症治疗,精准打击肿瘤细胞
大数据·人工智能·机器人
vosokcc@yuyinjiqiren34 分钟前
ai智能语音机器人电销系统:让销售更快速高效
大数据·服务器·网络·人工智能·机器人
FL162386312939 分钟前
[数据集][目标检测]睡岗检测数据集VOC+YOLO格式3290张4类别
人工智能·yolo·目标检测
今日信息差1 小时前
7月04日,每日信息差
大数据·人工智能·科技·阿里云·云计算
O zil1 小时前
资料分析题目类型分类
人工智能·分类·数据挖掘
华为云PaaS服务小智1 小时前
HDC Cloud 2024 | CodeArts加速软件智能化开发,携手HarmonyOS重塑企业应用创新体验
人工智能·华为·harmonyos
Recitative2 小时前
python单元测试入门
人工智能·python·深度学习·机器学习·单元测试