【Machine Learning】Unsupervised Learning

本笔记基于清华大学《机器学习》的课程讲义无监督学习相关部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。

Principle Component Analysis

  • Dimension reductio: JL lemma d = Ω ( log ⁡ n ϵ 2 ) d=\Omega\left(\frac{\log n}{\epsilon^2}\right) d=Ω(ϵ2logn) to remain the distance of n n n data points.
  • Goal of PCA
    • maximize variance: E [ ( v ⊤ x ) 2 ] = v ⊤ X X ⊤ v \mathbb{E}[(v^\top x)^2]=v^\top XX^\top v E[(v⊤x)2]=v⊤XX⊤v for ∣ v ∣ = 1 |v|=1 ∣v∣=1
    • minimize reconstruction error: E [ ∣ x − ( v ⊤ x ) v ∣ 2 ] \mathbb{E}[|x-(v^\top x)v|^2] E[∣x−(v⊤x)v∣2]
  • Find v i v_i vi iteratively, project data points onto subspace expanded by v 1 , v 2 , . . , v d v_1,v_2,..,v_d v1,v2,..,vd
  • How to find v v v ?
    • Eigen decomposition: X X ⊤ = U Σ U ⊤ XX^\top =U\Sigma U^\top XX⊤=UΣU⊤
    • v 1 v_1 v1 is the eigenvector of maximum eigenvalue.
    • Power method

Nearest Neighbor Classification

  • KNN: K-nearest neighbor
  • nearest neighbor search: Locality sensitive hashing algorithm(LSH)*
    • Randomized c c c-approximate R R R-near neighbor( ( c , R ) (c,R) (c,R)-NN): A data structure that at least gives a c R cR cR neighbor in some probability if R R R neighbor exists.
    • A family H H H is called ( R , c R , P 1 , P 2 ) (R,cR,P_1,P_2) (R,cR,P1,P2)-sensitive if for any p , q ∈ R d p,q\in \mathbb{R}^d p,q∈Rd
      • if ∣ p − q ∣ ≤ R |p-q|\le R ∣p−q∣≤R, then Pr ⁡ H [ h ( q ) = h ( p ) ] ≥ P 1 \Pr_H[h(q)=h(p)]\ge P_1 PrH[h(q)=h(p)]≥P1
      • if ∣ p − q ∣ ≥ c R |p-q|\ge cR ∣p−q∣≥cR, then Pr ⁡ H [ h ( q ) = h ( p ) ] ≤ P 1 \Pr_H[h(q)=h(p)]\le P_1 PrH[h(q)=h(p)]≤P1
      • P 1 > P 2 P_1>P_2 P1>P2
    • Algroithm based on LSH family:
      • Construct g i ( x ) = ( h i , 1 ( x ) , h i , 2 ( x ) , . . . , h i , k ( x ) ) , 1 ≤ i ≤ L g_i(x)=(h_{i,1}(x),h_{i,2}(x),...,h_{i,k}(x)),1\le i\le L gi(x)=(hi,1(x),hi,2(x),...,hi,k(x)),1≤i≤L. All h i , j h_{i,j} hi,j are iid from H H H.
      • Check the element in the bucket of g i ( q ) g_i(q) gi(q), whether it's c R cR cR-near neighbor of q q q. Until we check 2 L + 1 2L+1 2L+1 times.
      • if R R R neighbor exists, w.p. at least 1 2 − 1 e \frac{1}{2}-\frac{1}{e} 21−e1 find c R cR cR-neighbor
      • ρ = log ⁡ 1 / P 1 log ⁡ 1 / P 2 , k = log ⁡ 1 / P 2 ( n ) , L = n ρ \rho=\frac{\log 1/P_1}{\log 1/P_2},k=\log_{1/P_2}(n),L=n^\rho ρ=log1/P2log1/P1,k=log1/P2(n),L=nρ
      • Proof

Metric Learning

  • project x i x_i xi into f ( x i ) f(x_i) f(xi)
  • Hard version(compare label of its neighbor)- soft version
  • Neighborhood Component Analysis(NCA)
    • p i , j ∼ exp ⁡ ( − ∥ f ( x i ) − f ( x j ) ∥ 2 ) p_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2) pi,j∼exp(−∥f(xi)−f(xj)∥2)
    • maximize ∑ i ∑ j ∈ C i p i , j \sum_{i}\sum_{j\in C_i}p_{i,j} ∑i∑j∈Cipi,j
  • LMNN: L = max ⁡ ( 0 , ∥ f ( x ) − f ( x + ) ∥ 2 − ∥ f ( x ) − f ( x − ) ∥ 2 + r ) L=\max(0,\|f(x)-f(x^+)\|_2-\|f(x)-f(x^-)\|_2+r) L=max(0,∥f(x)−f(x+)∥2−∥f(x)−f(x−)∥2+r)
    • x + , x − x^+,x^- x+,x− are worst cases.
    • r is margin

Spectral Cluster

  • K-means
  • Spectral graph clustring
    • Graph laplacian: L = D − A L=D-A L=D−A, A A A represents the similarity.
    • #zero eigenvalue = # connected component
    • Smallest k k k eigenvectors gives a partition of k k k clusters, do k k k-means on the row
    • Ratio cut can be transfered into finding the k k k smallest eigenvectors, which is the same as graph laplacian.

SimCLR*

  • Intelligence is positioning

  • InfoNCE loss
    L ( q , p 1 , { p i } i = 2 N ) = − log ⁡ exp ⁡ ( − ∥ f ( q ) − f ( p 1 ) ∣ 2 / ( 2 τ ) ∑ i = 1 N exp ⁡ ( − ∥ f ( q ) − f ( p i ) ∣ 2 / ( 2 τ ) L(q,p_1,\{p_i\}{i=2}^N)=-\log \frac{\exp(-\|f(q)-f(p_1)|^2/(2\tau)}{\sum{i=1}^{N}\exp(-\|f(q)-f(p_{i})|^2/(2\tau)} L(q,p1,{pi}i=2N)=−log∑i=1Nexp(−∥f(q)−f(pi)∣2/(2τ)exp(−∥f(q)−f(p1)∣2/(2τ)

  • Learn Z = f ( x ) Z=f(x) Z=f(x): map original data points into a space that semantic similarity is captured naturally.

    • Reproducing kernel Hilbert space: k ( f ( x 1 ) , f ( x 2 ) ) = ⟨ ϕ ( f ( x 1 ) ) , ϕ ( f ( x 2 ) ) ⟩ H k(f(x_1),f(x_2))=\langle\phi(f(x_1)),\phi(f(x_2))\rangle_H k(f(x1),f(x2))=⟨ϕ(f(x1)),ϕ(f(x2))⟩H. Inner product is a kernel function.
    • Usually, K Z , i , j = k ( Z i − Z j ) K_{Z,i,j}=k(Z_i-Z_j) KZ,i,j=k(Zi−Zj), k k k is gaussian.
  • We have a similarity matrix π \pi π about the dataset previously. π i , j \pi_{i,j} πi,j is the similarity of data i i i and j j j. We want the similarity matrix K Z K_Z KZ of f ( x ) f(x) f(x) is the same as that of x x x which is given manually. Let W X ∼ π , W Z ∼ K Z W_X\sim \pi,W_Z\sim K_Z WX∼π,WZ∼KZ, we want these two samples are the same.

  • Minimize crossentropy loss: H π k ( Z ) = − E W X ∼ P ( ⋅ ; π ) [ log ⁡ P ( W Z = W X ; K Z ) ] H_{\pi}^{k}(Z)=-\mathbb{E}_{W_X\sim P(\cdot ;\pi)}[\log P(W_Z=W_X;K_Z)] Hπk(Z)=−EWX∼P(⋅;π)[logP(WZ=WX;KZ)]

    • Equivalent to InfoNCE loss: Only care about row i i i, infoNCE loss is log ⁡ ( W Z , i = W X , i ) \log(W_{Z,i}=W_{X,i}) log(WZ,i=WX,i). The given pair q , p 1 q,p_1 q,p1 are sampled from similarity matrix π \pi π, which corresponds to W X ∼ P ( ⋅ ; π ) W_X\sim P(\cdot;\pi) WX∼P(⋅;π).
    • Equivalent to spectral clustering: equaivalent to arg ⁡ min ⁡ Z t r ( Z ⊤ L ∗ Z ) \arg \min_Ztr(Z^\top L^*Z) argminZtr(Z⊤L∗Z)

t-SNE

  • data visualization: map data into low dimension space(2D)

  • SNE: Same as NCA, want q i , j ∼ exp ⁡ ( − ∥ f ( x i ) − f ( x j ) ∥ 2 / ( 2 σ 2 ) ) q_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2/(2\sigma^2)) qi,j∼exp(−∥f(xi)−f(xj)∥2/(2σ2)) to be similar to p i , j ∼ exp ⁡ ( − ∥ x i − x j ∥ 2 / ( 2 σ i 2 ) ) p_{i,j}\sim \exp (-\|x_i-x_j\|^2/(2\sigma_i^2)) pi,j∼exp(−∥xi−xj∥2/(2σi2))

    • CrossEntropy loss − p i , j ⋅ log ⁡ q i , j p i , j -p_{i,j}\cdot \log \frac{q_{i,j}}{p_{i,j}} −pi,j⋅logpi,jqi,j
  • Crowding problem

  • Solved by t-SNE: let q i , j ∼ ( 1 + ∥ y j − y i ∥ 2 ) − 1 q_{i,j}\sim (1+\|y_j-y_i\|^2)^{-1} qi,j∼(1+∥yj−yi∥2)−1(student t-distribution)

    • The power − 1 -1 −1 is more heavy tail than Gaussian, then we can solve the crowding problem by shifting the distance.
相关推荐
凤枭香1 分钟前
Python Scikit-learn简介(二)
开发语言·python·机器学习·scikit-learn
AI小白龙*3 分钟前
Windows环境下搭建Qwen开发环境
人工智能·windows·自然语言处理·llm·llama·ai大模型·ollama
cetcht88889 分钟前
光伏电站项目-视频监控、微气象及安全警卫系统
运维·人工智能·物联网
惯师科技14 分钟前
TDK推出第二代用于汽车安全应用的6轴IMU
人工智能·安全·机器人·汽车·imu
lu_rong_qq1 小时前
决策树 DecisionTreeClassifier() 模型参数介绍
算法·决策树·机器学习
HPC_fac130520678161 小时前
科研深度学习:如何精选GPU以优化服务器性能
服务器·人工智能·深度学习·神经网络·机器学习·数据挖掘·gpu算力
猎嘤一号2 小时前
个人笔记本安装CUDA并配合Pytorch使用NVIDIA GPU训练神经网络的计算以及CPUvsGPU计算时间的测试代码
人工智能·pytorch·神经网络
天润融通2 小时前
天润融通携手挚达科技:AI技术重塑客户服务体验
人工智能
Elastic 中国社区官方博客4 小时前
使用 Elastic AI Assistant for Search 和 Azure OpenAI 实现从 0 到 60 的转变
大数据·人工智能·elasticsearch·microsoft·搜索引擎·ai·azure
江_小_白5 小时前
自动驾驶之激光雷达
人工智能·机器学习·自动驾驶