【Machine Learning】Unsupervised Learning

本笔记基于清华大学《机器学习》的课程讲义无监督学习相关部分，基本为笔者在考试前一两天所作的Cheat Sheet。内容较多，并不详细，主要作为复习和记忆的资料。

Principle Component Analysis

Dimension reductio: JL lemma d = Ω ( log ⁡ n ϵ 2 ) d=\Omega\left(\frac{\log n}{\epsilon^2}\right) d=Ω(ϵ2logn) to remain the distance of n n n data points.
Goal of PCA
- maximize variance: E [ ( v ⊤ x ) 2 ] = v ⊤ X X ⊤ v \mathbb{E}[(v^\top x)^2]=v^\top XX^\top v E[(v⊤x)2]=v⊤XX⊤v for ∣ v ∣ = 1 |v|=1 ∣v∣=1
- minimize reconstruction error: E [ ∣ x − ( v ⊤ x ) v ∣ 2 ] \mathbb{E}[|x-(v^\top x)v|^2] E[∣x−(v⊤x)v∣2]
Find v i v_i vi iteratively, project data points onto subspace expanded by v 1 , v 2 , . . , v d v_1,v_2,..,v_d v1,v2,..,vd
How to find v v v ?
- Eigen decomposition: X X ⊤ = U Σ U ⊤ XX^\top =U\Sigma U^\top XX⊤=UΣU⊤
- v 1 v_1 v1 is the eigenvector of maximum eigenvalue.
- Power method

Nearest Neighbor Classification

KNN: K-nearest neighbor
nearest neighbor search: Locality sensitive hashing algorithm(LSH)*
- Randomized c c c-approximate R R R-near neighbor( ( c , R ) (c,R) (c,R)-NN): A data structure that at least gives a c R cR cR neighbor in some probability if R R R neighbor exists.
- A family H H H is called ( R , c R , P 1 , P 2 ) (R,cR,P_1,P_2) (R,cR,P1,P2)-sensitive if for any p , q ∈ R d p,q\in \mathbb{R}^d p,q∈Rd
  - if ∣ p − q ∣ ≤ R |p-q|\le R ∣p−q∣≤R, then Pr ⁡ H [ h ( q ) = h ( p ) ] ≥ P 1 \Pr_H[h(q)=h(p)]\ge P_1 PrH[h(q)=h(p)]≥P1
  - if ∣ p − q ∣ ≥ c R |p-q|\ge cR ∣p−q∣≥cR, then Pr ⁡ H [ h ( q ) = h ( p ) ] ≤ P 1 \Pr_H[h(q)=h(p)]\le P_1 PrH[h(q)=h(p)]≤P1
  - P 1 > P 2 P_1>P_2 P1>P2
- Algroithm based on LSH family:
  - Construct g i ( x ) = ( h i , 1 ( x ) , h i , 2 ( x ) , . . . , h i , k ( x ) ) , 1 ≤ i ≤ L g_i(x)=(h_{i,1}(x),h_{i,2}(x),...,h_{i,k}(x)),1\le i\le L gi(x)=(hi,1(x),hi,2(x),...,hi,k(x)),1≤i≤L. All h i , j h_{i,j} hi,j are iid from H H H.
  - Check the element in the bucket of g i ( q ) g_i(q) gi(q), whether it's c R cR cR-near neighbor of q q q. Until we check 2 L + 1 2L+1 2L+1 times.
  - if R R R neighbor exists, w.p. at least 1 2 − 1 e \frac{1}{2}-\frac{1}{e} 21−e1 find c R cR cR-neighbor
  - ρ = log ⁡ 1 / P 1 log ⁡ 1 / P 2 , k = log ⁡ 1 / P 2 ( n ) , L = n ρ \rho=\frac{\log 1/P_1}{\log 1/P_2},k=\log_{1/P_2}(n),L=n^\rho ρ=log1/P2log1/P1,k=log1/P2(n),L=nρ
  - Proof

Metric Learning

project x i x_i xi into f ( x i ) f(x_i) f(xi)
Hard version(compare label of its neighbor)- soft version
Neighborhood Component Analysis(NCA)
- p i , j ∼ exp ⁡ ( − ∥ f ( x i ) − f ( x j ) ∥ 2 ) p_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2) pi,j∼exp(−∥f(xi)−f(xj)∥2)
- maximize ∑ i ∑ j ∈ C i p i , j \sum_{i}\sum_{j\in C_i}p_{i,j} ∑i∑j∈Cipi,j
LMNN: L = max ⁡ ( 0 , ∥ f ( x ) − f ( x + ) ∥ 2 − ∥ f ( x ) − f ( x − ) ∥ 2 + r ) L=\max(0,\|f(x)-f(x^+)\|_2-\|f(x)-f(x^-)\|_2+r) L=max(0,∥f(x)−f(x+)∥2−∥f(x)−f(x−)∥2+r)
- x + , x − x^+,x^- x+,x− are worst cases.
- r is margin

Spectral Cluster

K-means
Spectral graph clustring
- Graph laplacian: L = D − A L=D-A L=D−A, A A A represents the similarity.
- #zero eigenvalue = # connected component
- Smallest k k k eigenvectors gives a partition of k k k clusters, do k k k-means on the row
- Ratio cut can be transfered into finding the k k k smallest eigenvectors, which is the same as graph laplacian.

SimCLR*

Intelligence is positioning
InfoNCE loss
L ( q , p 1 , { p i } i = 2 N ) = − log ⁡ exp ⁡ ( − ∥ f ( q ) − f ( p 1 ) ∣ 2 / ( 2 τ ) ∑ i = 1 N exp ⁡ ( − ∥ f ( q ) − f ( p i ) ∣ 2 / ( 2 τ ) L(q,p_1,\{p_i\}{i=2}^N)=-\log \frac{\exp(-\|f(q)-f(p_1)|^2/(2\tau)}{\sum{i=1}^{N}\exp(-\|f(q)-f(p_{i})|^2/(2\tau)} L(q,p1,{pi}i=2N)=−log∑i=1Nexp(−∥f(q)−f(pi)∣2/(2τ)exp(−∥f(q)−f(p1)∣2/(2τ)
Learn Z = f ( x ) Z=f(x) Z=f(x): map original data points into a space that semantic similarity is captured naturally.
- Reproducing kernel Hilbert space: k ( f ( x 1 ) , f ( x 2 ) ) = ⟨ ϕ ( f ( x 1 ) ) , ϕ ( f ( x 2 ) ) ⟩ H k(f(x_1),f(x_2))=\langle\phi(f(x_1)),\phi(f(x_2))\rangle_H k(f(x1),f(x2))=⟨ϕ(f(x1)),ϕ(f(x2))⟩H. Inner product is a kernel function.
- Usually, K Z , i , j = k ( Z i − Z j ) K_{Z,i,j}=k(Z_i-Z_j) KZ,i,j=k(Zi−Zj), k k k is gaussian.
We have a similarity matrix π \pi π about the dataset previously. π i , j \pi_{i,j} πi,j is the similarity of data i i i and j j j. We want the similarity matrix K Z K_Z KZ of f ( x ) f(x) f(x) is the same as that of x x x which is given manually. Let W X ∼ π , W Z ∼ K Z W_X\sim \pi,W_Z\sim K_Z WX∼π,WZ∼KZ, we want these two samples are the same.
Minimize crossentropy loss: H π k ( Z ) = − E W X ∼ P ( ⋅ ; π ) [ log ⁡ P ( W Z = W X ; K Z ) ] H_{\pi}^{k}(Z)=-\mathbb{E}_{W_X\sim P(\cdot ;\pi)}[\log P(W_Z=W_X;K_Z)] Hπk(Z)=−EWX∼P(⋅;π)[logP(WZ=WX;KZ)]
- Equivalent to InfoNCE loss: Only care about row i i i, infoNCE loss is log ⁡ ( W Z , i = W X , i ) \log(W_{Z,i}=W_{X,i}) log(WZ,i=WX,i). The given pair q , p 1 q,p_1 q,p1 are sampled from similarity matrix π \pi π, which corresponds to W X ∼ P ( ⋅ ; π ) W_X\sim P(\cdot;\pi) WX∼P(⋅;π).
- Equivalent to spectral clustering: equaivalent to arg ⁡ min ⁡ Z t r ( Z ⊤ L ∗ Z ) \arg \min_Ztr(Z^\top L^*Z) argminZtr(Z⊤L∗Z)

t-SNE

data visualization: map data into low dimension space(2D)
SNE: Same as NCA, want q i , j ∼ exp ⁡ ( − ∥ f ( x i ) − f ( x j ) ∥ 2 / ( 2 σ 2 ) ) q_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2/(2\sigma^2)) qi,j∼exp(−∥f(xi)−f(xj)∥2/(2σ2)) to be similar to p i , j ∼ exp ⁡ ( − ∥ x i − x j ∥ 2 / ( 2 σ i 2 ) ) p_{i,j}\sim \exp (-\|x_i-x_j\|^2/(2\sigma_i^2)) pi,j∼exp(−∥xi−xj∥2/(2σi2))
- CrossEntropy loss − p i , j ⋅ log ⁡ q i , j p i , j -p_{i,j}\cdot \log \frac{q_{i,j}}{p_{i,j}} −pi,j⋅logpi,jqi,j
Crowding problem
Solved by t-SNE: let q i , j ∼ ( 1 + ∥ y j − y i ∥ 2 ) − 1 q_{i,j}\sim (1+\|y_j-y_i\|^2)^{-1} qi,j∼(1+∥yj−yi∥2)−1(student t-distribution)
- The power − 1 -1 −1 is more heavy tail than Gaussian, then we can solve the crowding problem by shifting the distance.