【Machine Learning】Unsupervised Learning

本笔记基于清华大学《机器学习》的课程讲义无监督学习相关部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。

Principle Component Analysis

  • Dimension reductio: JL lemma d = Ω ( log ⁡ n ϵ 2 ) d=\Omega\left(\frac{\log n}{\epsilon^2}\right) d=Ω(ϵ2logn) to remain the distance of n n n data points.
  • Goal of PCA
    • maximize variance: E [ ( v ⊤ x ) 2 ] = v ⊤ X X ⊤ v \mathbb{E}[(v^\top x)^2]=v^\top XX^\top v E[(v⊤x)2]=v⊤XX⊤v for ∣ v ∣ = 1 |v|=1 ∣v∣=1
    • minimize reconstruction error: E [ ∣ x − ( v ⊤ x ) v ∣ 2 ] \mathbb{E}[|x-(v^\top x)v|^2] E[∣x−(v⊤x)v∣2]
  • Find v i v_i vi iteratively, project data points onto subspace expanded by v 1 , v 2 , . . , v d v_1,v_2,..,v_d v1,v2,..,vd
  • How to find v v v ?
    • Eigen decomposition: X X ⊤ = U Σ U ⊤ XX^\top =U\Sigma U^\top XX⊤=UΣU⊤
    • v 1 v_1 v1 is the eigenvector of maximum eigenvalue.
    • Power method

Nearest Neighbor Classification

  • KNN: K-nearest neighbor
  • nearest neighbor search: Locality sensitive hashing algorithm(LSH)*
    • Randomized c c c-approximate R R R-near neighbor( ( c , R ) (c,R) (c,R)-NN): A data structure that at least gives a c R cR cR neighbor in some probability if R R R neighbor exists.
    • A family H H H is called ( R , c R , P 1 , P 2 ) (R,cR,P_1,P_2) (R,cR,P1,P2)-sensitive if for any p , q ∈ R d p,q\in \mathbb{R}^d p,q∈Rd
      • if ∣ p − q ∣ ≤ R |p-q|\le R ∣p−q∣≤R, then Pr ⁡ H [ h ( q ) = h ( p ) ] ≥ P 1 \Pr_H[h(q)=h(p)]\ge P_1 PrH[h(q)=h(p)]≥P1
      • if ∣ p − q ∣ ≥ c R |p-q|\ge cR ∣p−q∣≥cR, then Pr ⁡ H [ h ( q ) = h ( p ) ] ≤ P 1 \Pr_H[h(q)=h(p)]\le P_1 PrH[h(q)=h(p)]≤P1
      • P 1 > P 2 P_1>P_2 P1>P2
    • Algroithm based on LSH family:
      • Construct g i ( x ) = ( h i , 1 ( x ) , h i , 2 ( x ) , . . . , h i , k ( x ) ) , 1 ≤ i ≤ L g_i(x)=(h_{i,1}(x),h_{i,2}(x),...,h_{i,k}(x)),1\le i\le L gi(x)=(hi,1(x),hi,2(x),...,hi,k(x)),1≤i≤L. All h i , j h_{i,j} hi,j are iid from H H H.
      • Check the element in the bucket of g i ( q ) g_i(q) gi(q), whether it's c R cR cR-near neighbor of q q q. Until we check 2 L + 1 2L+1 2L+1 times.
      • if R R R neighbor exists, w.p. at least 1 2 − 1 e \frac{1}{2}-\frac{1}{e} 21−e1 find c R cR cR-neighbor
      • ρ = log ⁡ 1 / P 1 log ⁡ 1 / P 2 , k = log ⁡ 1 / P 2 ( n ) , L = n ρ \rho=\frac{\log 1/P_1}{\log 1/P_2},k=\log_{1/P_2}(n),L=n^\rho ρ=log1/P2log1/P1,k=log1/P2(n),L=nρ
      • Proof

Metric Learning

  • project x i x_i xi into f ( x i ) f(x_i) f(xi)
  • Hard version(compare label of its neighbor)- soft version
  • Neighborhood Component Analysis(NCA)
    • p i , j ∼ exp ⁡ ( − ∥ f ( x i ) − f ( x j ) ∥ 2 ) p_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2) pi,j∼exp(−∥f(xi)−f(xj)∥2)
    • maximize ∑ i ∑ j ∈ C i p i , j \sum_{i}\sum_{j\in C_i}p_{i,j} ∑i∑j∈Cipi,j
  • LMNN: L = max ⁡ ( 0 , ∥ f ( x ) − f ( x + ) ∥ 2 − ∥ f ( x ) − f ( x − ) ∥ 2 + r ) L=\max(0,\|f(x)-f(x^+)\|_2-\|f(x)-f(x^-)\|_2+r) L=max(0,∥f(x)−f(x+)∥2−∥f(x)−f(x−)∥2+r)
    • x + , x − x^+,x^- x+,x− are worst cases.
    • r is margin

Spectral Cluster

  • K-means
  • Spectral graph clustring
    • Graph laplacian: L = D − A L=D-A L=D−A, A A A represents the similarity.
    • #zero eigenvalue = # connected component
    • Smallest k k k eigenvectors gives a partition of k k k clusters, do k k k-means on the row
    • Ratio cut can be transfered into finding the k k k smallest eigenvectors, which is the same as graph laplacian.

SimCLR*

  • Intelligence is positioning

  • InfoNCE loss
    L ( q , p 1 , { p i } i = 2 N ) = − log ⁡ exp ⁡ ( − ∥ f ( q ) − f ( p 1 ) ∣ 2 / ( 2 τ ) ∑ i = 1 N exp ⁡ ( − ∥ f ( q ) − f ( p i ) ∣ 2 / ( 2 τ ) L(q,p_1,\{p_i\}{i=2}^N)=-\log \frac{\exp(-\|f(q)-f(p_1)|^2/(2\tau)}{\sum{i=1}^{N}\exp(-\|f(q)-f(p_{i})|^2/(2\tau)} L(q,p1,{pi}i=2N)=−log∑i=1Nexp(−∥f(q)−f(pi)∣2/(2τ)exp(−∥f(q)−f(p1)∣2/(2τ)

  • Learn Z = f ( x ) Z=f(x) Z=f(x): map original data points into a space that semantic similarity is captured naturally.

    • Reproducing kernel Hilbert space: k ( f ( x 1 ) , f ( x 2 ) ) = ⟨ ϕ ( f ( x 1 ) ) , ϕ ( f ( x 2 ) ) ⟩ H k(f(x_1),f(x_2))=\langle\phi(f(x_1)),\phi(f(x_2))\rangle_H k(f(x1),f(x2))=⟨ϕ(f(x1)),ϕ(f(x2))⟩H. Inner product is a kernel function.
    • Usually, K Z , i , j = k ( Z i − Z j ) K_{Z,i,j}=k(Z_i-Z_j) KZ,i,j=k(Zi−Zj), k k k is gaussian.
  • We have a similarity matrix π \pi π about the dataset previously. π i , j \pi_{i,j} πi,j is the similarity of data i i i and j j j. We want the similarity matrix K Z K_Z KZ of f ( x ) f(x) f(x) is the same as that of x x x which is given manually. Let W X ∼ π , W Z ∼ K Z W_X\sim \pi,W_Z\sim K_Z WX∼π,WZ∼KZ, we want these two samples are the same.

  • Minimize crossentropy loss: H π k ( Z ) = − E W X ∼ P ( ⋅ ; π ) [ log ⁡ P ( W Z = W X ; K Z ) ] H_{\pi}^{k}(Z)=-\mathbb{E}_{W_X\sim P(\cdot ;\pi)}[\log P(W_Z=W_X;K_Z)] Hπk(Z)=−EWX∼P(⋅;π)[logP(WZ=WX;KZ)]

    • Equivalent to InfoNCE loss: Only care about row i i i, infoNCE loss is log ⁡ ( W Z , i = W X , i ) \log(W_{Z,i}=W_{X,i}) log(WZ,i=WX,i). The given pair q , p 1 q,p_1 q,p1 are sampled from similarity matrix π \pi π, which corresponds to W X ∼ P ( ⋅ ; π ) W_X\sim P(\cdot;\pi) WX∼P(⋅;π).
    • Equivalent to spectral clustering: equaivalent to arg ⁡ min ⁡ Z t r ( Z ⊤ L ∗ Z ) \arg \min_Ztr(Z^\top L^*Z) argminZtr(Z⊤L∗Z)

t-SNE

  • data visualization: map data into low dimension space(2D)

  • SNE: Same as NCA, want q i , j ∼ exp ⁡ ( − ∥ f ( x i ) − f ( x j ) ∥ 2 / ( 2 σ 2 ) ) q_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2/(2\sigma^2)) qi,j∼exp(−∥f(xi)−f(xj)∥2/(2σ2)) to be similar to p i , j ∼ exp ⁡ ( − ∥ x i − x j ∥ 2 / ( 2 σ i 2 ) ) p_{i,j}\sim \exp (-\|x_i-x_j\|^2/(2\sigma_i^2)) pi,j∼exp(−∥xi−xj∥2/(2σi2))

    • CrossEntropy loss − p i , j ⋅ log ⁡ q i , j p i , j -p_{i,j}\cdot \log \frac{q_{i,j}}{p_{i,j}} −pi,j⋅logpi,jqi,j
  • Crowding problem

  • Solved by t-SNE: let q i , j ∼ ( 1 + ∥ y j − y i ∥ 2 ) − 1 q_{i,j}\sim (1+\|y_j-y_i\|^2)^{-1} qi,j∼(1+∥yj−yi∥2)−1(student t-distribution)

    • The power − 1 -1 −1 is more heavy tail than Gaussian, then we can solve the crowding problem by shifting the distance.
相关推荐
恋猫de小郭1 小时前
AI 正在造就你的「认知卸载」,但是时代如此
前端·人工智能·ai编程
飞哥数智坊9 小时前
我的“龙虾”罢工了!正好对比下GLM、MiniMax、Kimi 3家谁更香
人工智能
风象南10 小时前
很多人说,AI 让技术平权了,小白也能乱杀老师傅 ?
人工智能·后端
董董灿是个攻城狮11 小时前
大模型连载1:了解 Token
人工智能
RoyLin14 小时前
沉睡三十年的标准:HTTP 402、生成式 UI 与智能体原生软件的时代
人工智能
needn16 小时前
TRAE为什么要发布SOLO版本?
人工智能·ai编程
毅航16 小时前
自然语言处理发展史:从规则、统计到深度学习
人工智能·后端
前端付豪16 小时前
LangChain链 写一篇完美推文?用SequencialChain链接不同的组件
人工智能·python·langchain
ursazoo16 小时前
写了一份 7000字指南,让 AI 帮我消化每天的信息流
人工智能·开源·github
_志哥_20 小时前
Superpowers 技术指南:让 AI 编程助手拥有超能力
人工智能·ai编程·测试