Count-based Approach
data:image/s3,"s3://crabby-images/d22d3/d22d33cfaa744d781f11b42e5f1b448c3a0de596" alt=""
Term-document matrix: Document vectors
Two ways to extract information from the matrix:
- Column-wise: a document is represented by a |V|-dim vector (V: vocabulary)
Widely used in information retrieval:
-
find similar documents 查找類似的文件
- Two documents that are similar will tend to have similar words
-
find documents close to a query 查找附近的查詢的文件
- Consider a query as a document
- Row-wise: a word is represented by a |D|-dim vector (D: document set
data:image/s3,"s3://crabby-images/f4463/f4463988498870dca19f4f63f1b69b4b70692e86" alt=""
Term-term matrix
we have seen it before (co-occurrence vectors): Count how many times a word u appearing with a word v
- raw frequency is bad 原始頻率不良
- Not all contextual words are equally important: of, a, ... vs. sugar, jam, fruit... 並非所有上下文單詞同樣重要
- Which words are important, which ones are not?
- infrequent words are more important than frequent ones (examples?
- ) 不頻繁的單詞比常見單詞更重要
- correlated words are more important than uncorrelated ones (examples?)
- ...
→ weighing schemes (TF-IDF, PMI,...)
Weighing terms: TF-IDF (for term-document matrix)
-
tf (frequency count):
t f ( t , d ) = log 10 ( 1 + c o u n t ( t , d ) ) tf(t,d)=\log_{10}(1+count(t,d)) tf(t,d)=log10(1+count(t,d)) -
idf (inverse document frequency): popular terms (terms that appear in many documents) are down weighed
t f ( t , d ) = log 10 N d f ( t ) tf(t,d)=\log_{10}\frac{N}{df(t)} tf(t,d)=log10df(t)N -
TF - IDF:
t f − i d f ( t , d ) = t f ( t , d ) ∗ i d f ( t ) tf - idf(t,d) = tf(t,d) *idf(t) tf−idf(t,d)=tf(t,d)∗idf(t)
-
Many word pairs should have > 0 counts, but their corresponding matrix entries are 0s because of lacking data (data sparsity)
→ Laplace smoothing: adding 1 to every entry (pseudocount)
Pros | Cons |
---|---|
Simple and intuitive | Word/document vectors are sparse (dims are |V|, vocabulary size, or |D|, number of documents, often from 2k to 10k) → difficult for machine learning algorithms |
Dimensions are meaningful (e.g., each dim is a document / a contextual word) → easy to debug and interpret (Think about Explainable AI) | How to represent word meaning in a specific context? |
(From sparse vectors to dense vectors)-> | Employ dimensionality reduction (e.g., latent semantic analysis - LSA) |
Use a different approach: prediction (coming up next) |
Prediction-based Approach
Introduction to ANNs used to learn word embeddings
- two major count-based approach methods:
- term-document matrix 術語文檔矩陣
- term-term matrix 術語矩陣
- Raw frequency is bed
- using weighing schemes to "correct" counts使用稱重方案
- using smoothing to take into account "unseen" events使用平滑來考慮看不見的事件
Formalisation
Assumptions:
● each word w ∈ V is represented by a vector v ∈ R d (d is often smaller than 3k)
● there is a mechanism to compute the probability Pr (w|u 1 , u 2 , ..., u l) of the event that a target word w appears in a context (u 1 , u 2 , ..., u l ).
Task: find a vector v for each word w such that those probabilities are as high as
possible for each w and its context (u 1 , u 2 , ..., u l ).
We use a neural network with parameters θ to compute the probability by minimizing the cross-entropy loss使用具有最小參數θ的神經網絡。透過最小化交叉熵損失來計算概率
L ( θ ) = − ∑ ( w , u 1 , ... , u l ) ∈ D train log Pr ( w ∣ u 1 , ... , u l ) L(\theta) = -\sum_{(w, u_1, \ldots, u_l) \in D_{\text{train}}} \log \Pr(w \mid u_1, \ldots, u_l) L(θ)=−∑(w,u1,...,ul)∈DtrainlogPr(w∣u1,...,ul)
Bengio
data:image/s3,"s3://crabby-images/134ee/134ee0be82c7c8a6bbd7b2767d415c89417a1ac8" alt=""
CBOW:
CBOW 模型的工作原理
-
輸入 (Input Layer):
- 給定一個目標詞 w t w_t wt ,選取其 前 m 個詞 和 後 m 個詞 作為上下文詞 (context words)。
- 這些詞會從詞嵌入矩陣 CCC 中查找對應的詞向量。
-
投影層 (Projection Layer)
- 將這些上下文詞對應的詞向量取平均: y = average ( w t − 1 , . . . , w t − m , w t + 1 , . . . , w t + m ) y = \text{average}(w_{t-1}, ..., w_{t-m}, w_{t+1}, ..., w_{t+m}) y=average(wt−1,...,wt−m,wt+1,...,wt+m)
- 這一步沒有非線性變換 (例如 ReLU 或 tanh),只是簡單的平均。
-
輸出層 (Output Layer)
- 計算該平均向量 y y y 與詞彙矩陣 W W W 的線性變換,並使用 softmax 來預測中心詞 wtw_twt: P ( w t ∣ w t − 1 , . . . , w t − m , w t + 1 , . . . , w t + m ) = softmax ( W y ) P ( w t ∣ w t − 1 , . . . , w t − m , w t + 1 , . . . , w t + m ) = s o f t m a x ( W y ) P(w_t | w_{t-1}, ..., w_{t-m}, w_{t+1}, ..., w_{t+m}) = \text{softmax}(Wy)P(wt∣wt−1,...,wt−m,wt+1,...,wt+m)=softmax(Wy) P(wt∣wt−1,...,wt−m,wt+1,...,wt+m)=softmax(Wy)P(wt∣wt−1,...,wt−m,wt+1,...,wt+m)=softmax(Wy)
- Softmax 的輸出是一個機率分佈,表示詞彙表 (vocabulary) 中每個詞作為中心詞的可能性。
CBOW 的特點
- 上下文到目標詞 :它是從上下文詞預測中心詞(這與 Skip-gram 相反,Skip-gram 是用中心詞預測周圍的上下文詞)。
- 計算高效 :由於使用平均詞向量,CBOW 計算速度通常比 Skip-gram 更快,尤其是在大語料庫上訓練時。
- 適合大規模語料庫:CBOW 在大語料庫下通常表現更穩定,適合訓練大詞彙的詞向量。
CBOW 在 NLP 任務中的影響
- 詞向量學習 :CBOW 提供了一種高效的方式來學習詞向量,後來影響了 GloVe、FastText 等模型的發展。
- 語意計算 :學到的詞向量可以用來計算詞語之間的語義相似性,例如餘弦相似度 (cosine similarity)。
- 下游應用 :CBOW 訓練出的詞向量可以應用於 文本分類、情感分析、機器翻譯 等 NLP 任務。
data:image/s3,"s3://crabby-images/66a0d/66a0d386d265675a7ee885ea31e51d9f2b6d0778" alt=""
data:image/s3,"s3://crabby-images/dc523/dc523cb722ebaae1f389448a0a272fb24b609712" alt=""
data:image/s3,"s3://crabby-images/ad53a/ad53ae60f87b429681dd9cec992db2a90afb8ebc" alt=""
data:image/s3,"s3://crabby-images/0ac48/0ac48cf0da03e6f8c46e1274c631b3150f5ea2ae" alt=""
CBOW | Skip-gram | |
---|---|---|
目标 | 用上下文词预测中心词 | 用中心词预测上下文词 |
计算速度 | 快 | 慢(因为对每个中心词要预测多个上下文词) |
适用场景 | 大数据、大语料库 | 小数据、小语料库 |
效果 | 适合学习常见词的词向量 | 在低频词的词向量学习上更优 |
word2vec
- Skip-gram model
- "a baby step in Deep Learning but a giant leap towards Natural Language Processing"
- can capture linear relational meanings (i.e., analogy):
- king - man + women = queen
data:image/s3,"s3://crabby-images/2295d/2295d00904d22758c6760a3691df377602990e20" alt=""
data:image/s3,"s3://crabby-images/ef21b/ef21b733653c701450f2b6ff12d6b619d10b32c5" alt=""
Problems : biases (gender, ethnic, ...)
- Word embeddings are learned from data → they also capture biases implicitly appearing in the data
- Gender bias:
- "computer_programmer" is closer to "man" than "woman"
- "homemaker" is closer to "woman" than "man"
- Ethnic bias:
- African-American names are associated with unpleasant words (more than European-American names)
- ...
→ Debiasing embeddings is a hot (and very needed) research topic
Dealing with unknown words
- Many words are not in dictionaries
- New words are invented everyday
- Solution 1: using a special token #UNK # for all unknown words
- Solution 2: using characters/sub-words instead of words
- Characters (c-o-m-p-u-t-e-r instead of computer)
- Subwords (com-omp-mpu-put-ute-ter instead of computer)
Word embeddings in a specific context
- The meaning of a word standing alone can be different than its meaning in a specific context
- He lost all of his money when the bank failed.
- He stood on the bank of Amstel river and thought about his future.
- Solution: w |c = f (w, c)
- Solution 1: f is continuous w.r.t. c (contextual embeddings, e.g., ELMO, BERT - next week)
- Solution 2: f is discrete w.r.t. c (e.g., word sense disambiguation - coming up in the next video)
Summary
- Prediction-based approaches require neural network models, which are not
intuitive as count-based ones - Low dimensional vectors (about 200-400 dimensions)
- Dimensions are not easy to interpret
- Robust performance for NLP tasks
延伸:Word Embeddings 進化
-
靜態詞嵌入(Static Embeddings):
- Word2Vec、GloVe、FastText
- 缺點:一個詞的向量固定,不能根據不同上下文改變語義(如「bank」的不同意思)。
-
上下文敏感的詞嵌入(Contextualized Embeddings):
- ELMo、BERT、GPT
- 解決了詞義多義性(Polysemy)問題,能夠根據上下文動態調整詞向量。
Contextualised Word Embedding
Static map 靜態地圖
f trained on large corpus Based on co-occurrence of words