【Machine Learning】Other Stuff

本笔记基于清华大学《机器学习》的课程讲义中有关机器学习的此前未提到的部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。

Robust Machine Learning

Attack: PGD

  • max ⁡ δ ∈ Δ L o s s ( f θ ( x + δ ) , y ) \max_{\delta\in \Delta}Loss(f_\theta(x+\delta),y) maxδ∈ΔLoss(fθ(x+δ),y): find δ \delta δ that maximizes the loss
  • How to compute δ \delta δ?
    • Projected Gradient Descent: GD then project back to Δ \Delta Δ
    • Fast Gradient Sign Method: For example, Δ = { δ : ∣ δ ∣ ∞ ≤ ϵ } \Delta =\{\delta:|\delta|\infty\le \epsilon\} Δ={δ:∣δ∣∞≤ϵ}. Then as learning rate goes to ∞ \infty ∞, we always go to corner. Then δ = η ⋅ s i g n ( ∇ x L θ ( x + δ t , y ) ) \delta=\eta\cdot sign(\nabla_xL\theta(x+\delta_t,y)) δ=η⋅sign(∇xLθ(x+δt,y)), we only take the sign.
    • PGD: runs FGSM multiple times.

Defend: Adversarial Training

  • Daskin's Theorem:
    ∂ ∂ θ max ⁡ δ L ( f θ ( x + δ , y ) ) = ∂ ∂ θ L ( f θ ( x + δ ∗ , y ) ) \frac{\partial}{\partial \theta }\max_\delta L(f_\theta(x+\delta,y))=\frac{\partial }{\partial \theta}L(f_\theta(x+\delta^*,y)) ∂θ∂δmaxL(fθ(x+δ,y))=∂θ∂L(fθ(x+δ∗,y))

    Only care about the worst attack samples. For a batch of samples, find δ ∗ \delta^* δ∗, then do GD on θ \theta θ based on δ ∗ \delta^* δ∗.

  • Models become more semantically meaningful by adversarial training

Robust Feature Learning

  • For a dog photo x x x, attack it to x ′ x' x′, so that f ( x ′ ) = c a t f(x')=cat f(x′)=cat. This ( x ′ , c a t ) (x',cat) (x′,cat) training set gives model the "non-robust" features.
  • For train image x x x, generate from random initialization x τ x_\tau xτ such that g ( x ) ≈ g ( x τ ) g(x)\approx g(x_\tau) g(x)≈g(xτ). x τ x_\tau xτ is also a good training sample that gives similar robust feature to the model.

False Sense of Security

  • Backward pass differentiable approximation(ignore the gradient that is hard to compute).
  • Take multiple samples to solve the randomness.

Provable Robust Certificates

  • Classification problem: use histogram, pick the largest color nearby.
  • Compare the histogram centered at x x x and x + δ x+\delta x+δ:
    • Worst case: Greedy filling(*)

Hyperparameter

Bayesian Optimization

  • estimate the parameters as Gaussian.
  • explore both the great-variance point and the low point.

Gradient descent

  • Directly use GD to find hyperparameter: memory storage problem
  • SGD with momentum: v t + 1 = γ t v t − ( 1 − γ t ) ∇ w L ( w t , θ , t ) v_{t+1}=\gamma_tv_t-(1-\gamma_t)\nabla _wL(w_t,\theta ,t) vt+1=γtvt−(1−γt)∇wL(wt,θ,t)
    • Store w t + 1 , v t + 1 w_{t+1},v_{t+1} wt+1,vt+1 and ∇ w L ( w t , θ , t ) \nabla _wL(w_t,\theta,t) ∇wL(wt,θ,t), we can recover the whole chain
  • Need: Continuous
  • Work better than grid search

Best Arm identification

  • Successive Halving(SH) Algorithm(*)

    • Per round use B / log ⁡ 2 ( n ) B/\log_2(n) B/log2(n), log ⁡ 2 ( n ) \log_2(n) log2(n) rounds totally.
    • Each round, every survivor use the same budget. Only half of them survive in each round.
  • Assume v 1 > v 2 ≥ . . . ≥ v n , Δ i = v 1 − v i v_1>v_2\ge ...\ge v_n,\Delta_i=v_1-v_i v1>v2≥...≥vn,Δi=v1−vi. With probability 1 − δ 1-\delta 1−δ, the algorithm finds the best arm with B = O ( max ⁡ i > 1 i Δ i 2 log ⁡ n log ⁡ ( log ⁡ n / δ ) ) B=O(\max_{i>1}\frac{i}{\Delta_i^2}\log n\log(\log n/\delta)) B=O(maxi>1Δi2ilognlog(logn/δ)) assuming v i ∈ [ 0 , 1 ] v_i\in [0,1] vi∈[0,1]

    • Proof. Concentration inequaility, probability that 1 3 \frac{1}{3} 31 of 3 4 N r \frac{3}{4}N_r 43Nr smaller ones are greater than best one is bounded, union bound for all round.
  • Hyperparameter tuning

    • B ≥ 2 log ⁡ 2 ( n ) ( n + ∑ i = 2... n γ ˉ − 1 ( v i − v 1 2 ) ) B\ge 2\log_2(n)\left(n+\sum_{i=2...n}\bar{\gamma}^{-1}(\frac{v_i-v_1}{2})\right) B≥2log2(n)(n+∑i=2...nγˉ−1(2vi−v1))
  • Reinforcement learning
  • ProxylessNAS: each architecture has weight α i \alpha_i αi to be chosen. Choose a binary variable, that is only one architecture exists, and the probability is about α i \alpha_i αi. ∂ L ∂ α i ≈ ∂ L ∂ g i ∂ p i ∂ α i \frac{\partial L}{\partial \alpha_i}\approx\frac{\partial L}{\partial g_i}\frac{\partial p_i}{\partial \alpha_i} ∂αi∂L≈∂gi∂L∂αi∂pi, g i = 0 / 1 g_i=0/1 gi=0/1, ∂ L / ∂ g j \partial L/\partial g_j ∂L/∂gj means the influence of architecture j j j to L L L.

Machine Learning Augmented

Differencial Privacy

  • Randomness is essential.

    • If we have a non-trivial deterministic algorithm, we can change data base A A A to B B B one by one until their output is the same. There exists a pair of databases differ by one row, then we know about that row.
  • Database: histogram N ∣ X ∣ N^{|X|} N∣X∣ for discrete set X X X(the categories).

  • M M M is ( ϵ , δ ) (\epsilon,\delta ) (ϵ,δ)-differentially privacy if ∀ x , y ∈ N ∣ X ∣ , ∣ x − y ∣ 1 ≤ 1 , S ⊆ M ( N ∣ X ∣ ) \forall x,y\in N^{|X|},|x-y|_1\le 1,S\subseteq M(N^{|X|}) ∀x,y∈N∣X∣,∣x−y∣1≤1,S⊆M(N∣X∣)
    P ( M ( x ) ∈ S ) ≤ e ϵ P ( M ( y ) ∈ S ) + δ P(M(x)\in S)\le e^{\epsilon}P(M(y)\in S)+\delta P(M(x)∈S)≤eϵP(M(y)∈S)+δ

  • Differencial privacy is immune to post-processing. Proof:

    • For deterministic function: Proved easily by inverse function
    • Random function is convex combination of deterministic function, that is, each deterministic function is chosen with some probability.
  • With ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-differential privacy(DP mechanism), the voting result will not change too much by changing one's vote.
    E a ∼ f ( M ( x ) ) [ u i ( a ) ] = ∑ a ∈ A u i ( a ) Pr ⁡ f ( M ( x ) ) [ a ] ≤ ∑ a ∈ A u i ( a ) exp ⁡ ( ϵ ) Pr ⁡ f ( M ( y ) ) [ a ] = exp ⁡ ( ϵ ) E a ∼ f ( M ( y ) ) [ u i ( a ) ] \begin{align*} E_{a\sim f(M(x))}[u_i(a)]&=\sum_{a\in A} u_i(a)\Pr_{f(M(x))}[a]\\&\le \sum_{a\in A}u_i(a)\exp(\epsilon)\Pr_{f(M(y))}[a]\\&=\exp(\epsilon)E_{a\sim f(M(y))}[u_i(a)] \end{align*} Ea∼f(M(x))[ui(a)]=a∈A∑ui(a)f(M(x))Pr[a]≤a∈A∑ui(a)exp(ϵ)f(M(y))Pr[a]=exp(ϵ)Ea∼f(M(y))[ui(a)]
    f f f is a map to the feature, and we only consider the expected voting utility u i ( a ) u_i(a) ui(a) about the feature a a a.

  • Laplace Mechanism: reach ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-DP by just adding a random gaussian noise to f ( x ) , x ∈ N ∣ X ∣ f(x),x\in N^{|X|} f(x),x∈N∣X∣

    • Assume f : N ∣ X ∣ → R k f:N^{|X|}\to \R^{k} f:N∣X∣→Rk

    • Definition: M L ( x , f , ϵ ) = f ( x ) + ( Y 1 , . . , Y k ) M_L(x,f,\epsilon)=f(x)+(Y_1,..,Y_k) ML(x,f,ϵ)=f(x)+(Y1,..,Yk)

      • Y i Y_i Yi is iid random variable from L a p ( Δ f ϵ ) Lap(\frac{\Delta f }{\epsilon}) Lap(ϵΔf)
      • l 1 l_1 l1 sensitive of f f f is Δ f = max ⁡ x , y , ∣ x − y ∣ 1 ≤ 1 ∣ f ( x ) − f ( y ) ∣ 1 \Delta f=\max_{x,y,|x-y|_1\le 1}|f(x)-f(y)|_1 Δf=maxx,y,∣x−y∣1≤1∣f(x)−f(y)∣1. How sensitive f f f is by changing one entry a little.
      • Probability density: L a p ( b ) = 1 2 b exp ⁡ ( − ∣ x ∣ b ) Lap(b)=\frac{1}{2b}\exp(-\frac{|x|}{b}) Lap(b)=2b1exp(−b∣x∣). Variance σ 2 = 2 b 2 \sigma^2=2b^2 σ2=2b2.
    • Proof. The probability density of M L ( x , f , ϵ ) M_L(x,f,\epsilon) ML(x,f,ϵ) is p x ( z ) = ∏ i = 1 k exp ⁡ ( − ϵ ∣ f ( x ) i − z i ∣ Δ f ) p_x(z)=\prod_{i=1}^k\exp\left(-\frac{\epsilon|f(x)_i-z_i|}{\Delta f}\right) px(z)=∏i=1kexp(−Δfϵ∣f(x)i−zi∣). Easily prove p x ( z ) / p y ( z ) ≤ exp ⁡ ( ϵ ) p_x(z)/p_y(z)\le \exp(\epsilon) px(z)/py(z)≤exp(ϵ)

    • Accuracy: for δ ∈ ( 0 , 1 ] \delta\in(0,1] δ∈(0,1],
      Pr ⁡ [ ∣ f ( x ) − M L ( x , f , ϵ ) ∣ ∞ ≥ ln ⁡ ( k δ ) ⋅ ( Δ f ϵ ) ] ≤ δ \Pr\left[|f(x)-M_L(x,f,\epsilon)|_\infty\ge \ln\left(\frac{k}{\delta}\right)\cdot \left(\frac{\Delta f}{\epsilon}\right)\right]\le \delta Pr[∣f(x)−ML(x,f,ϵ)∣∞≥ln(δk)⋅(ϵΔf)]≤δ

      • Directly prove by Pr ⁡ [ ∣ Y ∣ ≥ t ⋅ b ] = exp ⁡ ( − t ) \Pr[|Y|\ge t\cdot b]=\exp(-t) Pr[∣Y∣≥t⋅b]=exp(−t) for Y ∼ L a p ( b ) Y\sim Lap(b) Y∼Lap(b)

Big data

  • Idea: data distribution rarely changes.
  • Replace B-tree with a model. Know err_min err_max because you only care about the data in your database
    • advantage: less storage, faster lookups, more parallelism, hardware accelators
    • use linear model(fast)
    • Do exponential search since the prediction is good enough
    • Bloom Filter: is the key in my set=>maybe yes/no

Low Rank Apporximation

  • A ∈ R n × m A\in \mathbb{R}^{n\times m} A∈Rn×m, [ A ] k = arg ⁡ min ⁡ r a n k ( A ′ ) ≤ k ∣ A − A ′ ∣ F [A]k=\arg \min{rank(A')\le k} |A-A'|_F [A]k=argminrank(A′)≤k∣A−A′∣F
  • Learn S ∈ R p × m S\in \mathbb{R}^{p\times m} S∈Rp×m and do low rank decomposition for S A SA SA, then do some recover.

Recified flow

  • Given empirical observations about distribution π 0 \pi_0 π0 and π 1 \pi_1 π1. Find a transport map T T T that T ( Z ) ∼ π 1 T(Z)\sim \pi_1 T(Z)∼π1 for Z ∼ π 0 Z\sim \pi_0 Z∼π0.
    • E.g. π 0 \pi_0 π0 is gaussian noise, π 1 \pi_1 π1 are data containing pictures. We want to find a map from noise to picture(diffusion) that has some features.
    • If T ( X 0 ) = X 1 T(X_0)=X_1 T(X0)=X1, then we want v ( t X 0 + ( 1 − t ) X 1 , t ) = d X d t = X 1 − X 0 v(tX_0+(1-t)X_1,t)=\frac{dX}{dt}=X_1-X_0 v(tX0+(1−t)X1,t)=dtdX=X1−X0. The transformation is linear. So that one cluster will map to another clusters.
    • Minimize the loss E [ ∥ ( X 1 − X 0 ) − v ( X t , t ) ∥ 2 ] \mathbb{E}[\|(X_1-X_0)-v(X_t,t)\|^2] E[∥(X1−X0)−v(Xt,t)∥2]
  • Algorithm:
    • randomly connect π 0 , π 1 \pi_0,\pi_1 π0,π1
    • minimize loss of \\theta for v θ v_\theta vθ
    • connect π 0 , π 1 \pi_0,\pi_1 π0,π1 by v θ v_\theta vθ again, and repeat minimization.

Rope Attention

  • Embed position to q m , k n q_m,k_n qm,kn for attention by rotation in complex field
    • β \beta β-base system: focus on different accuracy level
  • use inner product of q m , k n q_m,k_n qm,kn to represent their similarity
  • Softmax attention Attention ( Q , K , V ) = softmax ( Q K ⊤ ) V \text{Attention}(Q,K,V)=\text{softmax}(QK^\top)V Attention(Q,K,V)=softmax(QK⊤)V
  • Faster computation
    • linear
    • softmax separately
    • design Q K ⊤ QK^\top QK⊤
相关推荐
GIOTTO情22 分钟前
媒介宣发的技术革命:Infoseek如何用AI重构企业传播全链路
大数据·人工智能·重构
阿里云大数据AI技术30 分钟前
云栖实录 | 从多模态数据到 Physical AI,PAI 助力客户快速启动 Physical AI 实践
人工智能
小关会打代码38 分钟前
计算机视觉进阶教学之颜色识别
人工智能·计算机视觉
IT小哥哥呀44 分钟前
基于深度学习的数字图像分类实验与分析
人工智能·深度学习·分类
机器之心1 小时前
VAE时代终结?谢赛宁团队「RAE」登场,表征自编码器或成DiT训练新基石
人工智能·openai
机器之心1 小时前
Sutton判定「LLM是死胡同」后,新访谈揭示AI困境
人工智能·openai
大模型真好玩1 小时前
低代码Agent开发框架使用指南(四)—Coze大模型和插件参数配置最佳实践
人工智能·agent·coze
jerryinwuhan1 小时前
基于大语言模型(LLM)的城市时间、空间与情感交织分析:面向智能城市的情感动态预测与空间优化
人工智能·语言模型·自然语言处理
落雪财神意1 小时前
股指10月想法
大数据·人工智能·金融·区块链·期股
中杯可乐多加冰1 小时前
无代码开发实践|基于业务流能力快速开发市场监管系统,实现投诉处理快速响应
人工智能·低代码