【Machine Learning】Other Stuff

本笔记基于清华大学《机器学习》的课程讲义中有关机器学习的此前未提到的部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。

Robust Machine Learning

Attack: PGD

  • max ⁡ δ ∈ Δ L o s s ( f θ ( x + δ ) , y ) \max_{\delta\in \Delta}Loss(f_\theta(x+\delta),y) maxδ∈ΔLoss(fθ(x+δ),y): find δ \delta δ that maximizes the loss
  • How to compute δ \delta δ?
    • Projected Gradient Descent: GD then project back to Δ \Delta Δ
    • Fast Gradient Sign Method: For example, Δ = { δ : ∣ δ ∣ ∞ ≤ ϵ } \Delta =\{\delta:|\delta|\infty\le \epsilon\} Δ={δ:∣δ∣∞≤ϵ}. Then as learning rate goes to ∞ \infty ∞, we always go to corner. Then δ = η ⋅ s i g n ( ∇ x L θ ( x + δ t , y ) ) \delta=\eta\cdot sign(\nabla_xL\theta(x+\delta_t,y)) δ=η⋅sign(∇xLθ(x+δt,y)), we only take the sign.
    • PGD: runs FGSM multiple times.

Defend: Adversarial Training

  • Daskin's Theorem:
    ∂ ∂ θ max ⁡ δ L ( f θ ( x + δ , y ) ) = ∂ ∂ θ L ( f θ ( x + δ ∗ , y ) ) \frac{\partial}{\partial \theta }\max_\delta L(f_\theta(x+\delta,y))=\frac{\partial }{\partial \theta}L(f_\theta(x+\delta^*,y)) ∂θ∂δmaxL(fθ(x+δ,y))=∂θ∂L(fθ(x+δ∗,y))

    Only care about the worst attack samples. For a batch of samples, find δ ∗ \delta^* δ∗, then do GD on θ \theta θ based on δ ∗ \delta^* δ∗.

  • Models become more semantically meaningful by adversarial training

Robust Feature Learning

  • For a dog photo x x x, attack it to x ′ x' x′, so that f ( x ′ ) = c a t f(x')=cat f(x′)=cat. This ( x ′ , c a t ) (x',cat) (x′,cat) training set gives model the "non-robust" features.
  • For train image x x x, generate from random initialization x τ x_\tau xτ such that g ( x ) ≈ g ( x τ ) g(x)\approx g(x_\tau) g(x)≈g(xτ). x τ x_\tau xτ is also a good training sample that gives similar robust feature to the model.

False Sense of Security

  • Backward pass differentiable approximation(ignore the gradient that is hard to compute).
  • Take multiple samples to solve the randomness.

Provable Robust Certificates

  • Classification problem: use histogram, pick the largest color nearby.
  • Compare the histogram centered at x x x and x + δ x+\delta x+δ:
    • Worst case: Greedy filling(*)

Hyperparameter

Bayesian Optimization

  • estimate the parameters as Gaussian.
  • explore both the great-variance point and the low point.

Gradient descent

  • Directly use GD to find hyperparameter: memory storage problem
  • SGD with momentum: v t + 1 = γ t v t − ( 1 − γ t ) ∇ w L ( w t , θ , t ) v_{t+1}=\gamma_tv_t-(1-\gamma_t)\nabla _wL(w_t,\theta ,t) vt+1=γtvt−(1−γt)∇wL(wt,θ,t)
    • Store w t + 1 , v t + 1 w_{t+1},v_{t+1} wt+1,vt+1 and ∇ w L ( w t , θ , t ) \nabla _wL(w_t,\theta,t) ∇wL(wt,θ,t), we can recover the whole chain
  • Need: Continuous
  • Work better than grid search

Best Arm identification

  • Successive Halving(SH) Algorithm(*)

    • Per round use B / log ⁡ 2 ( n ) B/\log_2(n) B/log2(n), log ⁡ 2 ( n ) \log_2(n) log2(n) rounds totally.
    • Each round, every survivor use the same budget. Only half of them survive in each round.
  • Assume v 1 > v 2 ≥ . . . ≥ v n , Δ i = v 1 − v i v_1>v_2\ge ...\ge v_n,\Delta_i=v_1-v_i v1>v2≥...≥vn,Δi=v1−vi. With probability 1 − δ 1-\delta 1−δ, the algorithm finds the best arm with B = O ( max ⁡ i > 1 i Δ i 2 log ⁡ n log ⁡ ( log ⁡ n / δ ) ) B=O(\max_{i>1}\frac{i}{\Delta_i^2}\log n\log(\log n/\delta)) B=O(maxi>1Δi2ilognlog(logn/δ)) assuming v i ∈ [ 0 , 1 ] v_i\in [0,1] vi∈[0,1]

    • Proof. Concentration inequaility, probability that 1 3 \frac{1}{3} 31 of 3 4 N r \frac{3}{4}N_r 43Nr smaller ones are greater than best one is bounded, union bound for all round.
  • Hyperparameter tuning

    • B ≥ 2 log ⁡ 2 ( n ) ( n + ∑ i = 2... n γ ˉ − 1 ( v i − v 1 2 ) ) B\ge 2\log_2(n)\left(n+\sum_{i=2...n}\bar{\gamma}^{-1}(\frac{v_i-v_1}{2})\right) B≥2log2(n)(n+∑i=2...nγˉ−1(2vi−v1))
  • Reinforcement learning
  • ProxylessNAS: each architecture has weight α i \alpha_i αi to be chosen. Choose a binary variable, that is only one architecture exists, and the probability is about α i \alpha_i αi. ∂ L ∂ α i ≈ ∂ L ∂ g i ∂ p i ∂ α i \frac{\partial L}{\partial \alpha_i}\approx\frac{\partial L}{\partial g_i}\frac{\partial p_i}{\partial \alpha_i} ∂αi∂L≈∂gi∂L∂αi∂pi, g i = 0 / 1 g_i=0/1 gi=0/1, ∂ L / ∂ g j \partial L/\partial g_j ∂L/∂gj means the influence of architecture j j j to L L L.

Machine Learning Augmented

Differencial Privacy

  • Randomness is essential.

    • If we have a non-trivial deterministic algorithm, we can change data base A A A to B B B one by one until their output is the same. There exists a pair of databases differ by one row, then we know about that row.
  • Database: histogram N ∣ X ∣ N^{|X|} N∣X∣ for discrete set X X X(the categories).

  • M M M is ( ϵ , δ ) (\epsilon,\delta ) (ϵ,δ)-differentially privacy if ∀ x , y ∈ N ∣ X ∣ , ∣ x − y ∣ 1 ≤ 1 , S ⊆ M ( N ∣ X ∣ ) \forall x,y\in N^{|X|},|x-y|_1\le 1,S\subseteq M(N^{|X|}) ∀x,y∈N∣X∣,∣x−y∣1≤1,S⊆M(N∣X∣)
    P ( M ( x ) ∈ S ) ≤ e ϵ P ( M ( y ) ∈ S ) + δ P(M(x)\in S)\le e^{\epsilon}P(M(y)\in S)+\delta P(M(x)∈S)≤eϵP(M(y)∈S)+δ

  • Differencial privacy is immune to post-processing. Proof:

    • For deterministic function: Proved easily by inverse function
    • Random function is convex combination of deterministic function, that is, each deterministic function is chosen with some probability.
  • With ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-differential privacy(DP mechanism), the voting result will not change too much by changing one's vote.
    E a ∼ f ( M ( x ) ) [ u i ( a ) ] = ∑ a ∈ A u i ( a ) Pr ⁡ f ( M ( x ) ) [ a ] ≤ ∑ a ∈ A u i ( a ) exp ⁡ ( ϵ ) Pr ⁡ f ( M ( y ) ) [ a ] = exp ⁡ ( ϵ ) E a ∼ f ( M ( y ) ) [ u i ( a ) ] \begin{align*} E_{a\sim f(M(x))}[u_i(a)]&=\sum_{a\in A} u_i(a)\Pr_{f(M(x))}[a]\\&\le \sum_{a\in A}u_i(a)\exp(\epsilon)\Pr_{f(M(y))}[a]\\&=\exp(\epsilon)E_{a\sim f(M(y))}[u_i(a)] \end{align*} Ea∼f(M(x))[ui(a)]=a∈A∑ui(a)f(M(x))Pr[a]≤a∈A∑ui(a)exp(ϵ)f(M(y))Pr[a]=exp(ϵ)Ea∼f(M(y))[ui(a)]
    f f f is a map to the feature, and we only consider the expected voting utility u i ( a ) u_i(a) ui(a) about the feature a a a.

  • Laplace Mechanism: reach ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-DP by just adding a random gaussian noise to f ( x ) , x ∈ N ∣ X ∣ f(x),x\in N^{|X|} f(x),x∈N∣X∣

    • Assume f : N ∣ X ∣ → R k f:N^{|X|}\to \R^{k} f:N∣X∣→Rk

    • Definition: M L ( x , f , ϵ ) = f ( x ) + ( Y 1 , . . , Y k ) M_L(x,f,\epsilon)=f(x)+(Y_1,..,Y_k) ML(x,f,ϵ)=f(x)+(Y1,..,Yk)

      • Y i Y_i Yi is iid random variable from L a p ( Δ f ϵ ) Lap(\frac{\Delta f }{\epsilon}) Lap(ϵΔf)
      • l 1 l_1 l1 sensitive of f f f is Δ f = max ⁡ x , y , ∣ x − y ∣ 1 ≤ 1 ∣ f ( x ) − f ( y ) ∣ 1 \Delta f=\max_{x,y,|x-y|_1\le 1}|f(x)-f(y)|_1 Δf=maxx,y,∣x−y∣1≤1∣f(x)−f(y)∣1. How sensitive f f f is by changing one entry a little.
      • Probability density: L a p ( b ) = 1 2 b exp ⁡ ( − ∣ x ∣ b ) Lap(b)=\frac{1}{2b}\exp(-\frac{|x|}{b}) Lap(b)=2b1exp(−b∣x∣). Variance σ 2 = 2 b 2 \sigma^2=2b^2 σ2=2b2.
    • Proof. The probability density of M L ( x , f , ϵ ) M_L(x,f,\epsilon) ML(x,f,ϵ) is p x ( z ) = ∏ i = 1 k exp ⁡ ( − ϵ ∣ f ( x ) i − z i ∣ Δ f ) p_x(z)=\prod_{i=1}^k\exp\left(-\frac{\epsilon|f(x)_i-z_i|}{\Delta f}\right) px(z)=∏i=1kexp(−Δfϵ∣f(x)i−zi∣). Easily prove p x ( z ) / p y ( z ) ≤ exp ⁡ ( ϵ ) p_x(z)/p_y(z)\le \exp(\epsilon) px(z)/py(z)≤exp(ϵ)

    • Accuracy: for δ ∈ ( 0 , 1 ] \delta\in(0,1] δ∈(0,1],
      Pr ⁡ [ ∣ f ( x ) − M L ( x , f , ϵ ) ∣ ∞ ≥ ln ⁡ ( k δ ) ⋅ ( Δ f ϵ ) ] ≤ δ \Pr\left[|f(x)-M_L(x,f,\epsilon)|_\infty\ge \ln\left(\frac{k}{\delta}\right)\cdot \left(\frac{\Delta f}{\epsilon}\right)\right]\le \delta Pr[∣f(x)−ML(x,f,ϵ)∣∞≥ln(δk)⋅(ϵΔf)]≤δ

      • Directly prove by Pr ⁡ [ ∣ Y ∣ ≥ t ⋅ b ] = exp ⁡ ( − t ) \Pr[|Y|\ge t\cdot b]=\exp(-t) Pr[∣Y∣≥t⋅b]=exp(−t) for Y ∼ L a p ( b ) Y\sim Lap(b) Y∼Lap(b)

Big data

  • Idea: data distribution rarely changes.
  • Replace B-tree with a model. Know err_min err_max because you only care about the data in your database
    • advantage: less storage, faster lookups, more parallelism, hardware accelators
    • use linear model(fast)
    • Do exponential search since the prediction is good enough
    • Bloom Filter: is the key in my set=>maybe yes/no

Low Rank Apporximation

  • A ∈ R n × m A\in \mathbb{R}^{n\times m} A∈Rn×m, [ A ] k = arg ⁡ min ⁡ r a n k ( A ′ ) ≤ k ∣ A − A ′ ∣ F [A]k=\arg \min{rank(A')\le k} |A-A'|_F [A]k=argminrank(A′)≤k∣A−A′∣F
  • Learn S ∈ R p × m S\in \mathbb{R}^{p\times m} S∈Rp×m and do low rank decomposition for S A SA SA, then do some recover.

Recified flow

  • Given empirical observations about distribution π 0 \pi_0 π0 and π 1 \pi_1 π1. Find a transport map T T T that T ( Z ) ∼ π 1 T(Z)\sim \pi_1 T(Z)∼π1 for Z ∼ π 0 Z\sim \pi_0 Z∼π0.
    • E.g. π 0 \pi_0 π0 is gaussian noise, π 1 \pi_1 π1 are data containing pictures. We want to find a map from noise to picture(diffusion) that has some features.
    • If T ( X 0 ) = X 1 T(X_0)=X_1 T(X0)=X1, then we want v ( t X 0 + ( 1 − t ) X 1 , t ) = d X d t = X 1 − X 0 v(tX_0+(1-t)X_1,t)=\frac{dX}{dt}=X_1-X_0 v(tX0+(1−t)X1,t)=dtdX=X1−X0. The transformation is linear. So that one cluster will map to another clusters.
    • Minimize the loss E [ ∥ ( X 1 − X 0 ) − v ( X t , t ) ∥ 2 ] \mathbb{E}[\|(X_1-X_0)-v(X_t,t)\|^2] E[∥(X1−X0)−v(Xt,t)∥2]
  • Algorithm:
    • randomly connect π 0 , π 1 \pi_0,\pi_1 π0,π1
    • minimize loss of $\theta $ for v θ v_\theta vθ
    • connect π 0 , π 1 \pi_0,\pi_1 π0,π1 by v θ v_\theta vθ again, and repeat minimization.

Rope Attention

  • Embed position to q m , k n q_m,k_n qm,kn for attention by rotation in complex field
    • β \beta β-base system: focus on different accuracy level
  • use inner product of q m , k n q_m,k_n qm,kn to represent their similarity
  • Softmax attention Attention ( Q , K , V ) = softmax ( Q K ⊤ ) V \text{Attention}(Q,K,V)=\text{softmax}(QK^\top)V Attention(Q,K,V)=softmax(QK⊤)V
  • Faster computation
    • linear
    • softmax separately
    • design Q K ⊤ QK^\top QK⊤
相关推荐
jndingxin26 分钟前
OpenCV特征检测(1)检测图像中的线段的类LineSegmentDe()的使用
人工智能·opencv·计算机视觉
@月落36 分钟前
alibaba获得店铺的所有商品 API接口
java·大数据·数据库·人工智能·学习
z千鑫1 小时前
【人工智能】如何利用AI轻松将java,c++等代码转换为Python语言?程序员必读
java·c++·人工智能·gpt·agent·ai编程·ai工具
y_dd1 小时前
【machine learning-12-多元线性回归】
算法·机器学习·线性回归
MinIO官方账号1 小时前
从 HDFS 迁移到 MinIO 企业对象存储
人工智能·分布式·postgresql·架构·开源
小魏冬琅1 小时前
K-means 算法的介绍与应用
算法·机器学习·kmeans
aWty_1 小时前
机器学习--K-Means
人工智能·机器学习·kmeans
草莓屁屁我不吃1 小时前
AI大语言模型的全面解读
人工智能·语言模型·自然语言处理·chatgpt
WPG大大通2 小时前
有奖直播 | onsemi IPM 助力汽车电气革命及电子化时代冷热管理
大数据·人工智能·汽车·方案·电气·大大通·研讨会
百锦再2 小时前
AI对汽车行业的冲击和比亚迪新能源汽车市场占比
人工智能·汽车