Word Embeddings

Count-based Approach

Term-document matrix: Document vectors

Two ways to extract information from the matrix:

  1. Column-wise: a document is represented by a |V|-dim vector (V: vocabulary)

Widely used in information retrieval:

  • find similar documents 查找類似的文件

    • Two documents that are similar will tend to have similar words
  • find documents close to a query 查找附近的查詢的文件

    • Consider a query as a document
  1. Row-wise: a word is represented by a |D|-dim vector (D: document set

Term-term matrix

we have seen it before (co-occurrence vectors): Count how many times a word u appearing with a word v

  • raw frequency is bad 原始頻率不良
    • Not all contextual words are equally important: of, a, ... vs. sugar, jam, fruit... 並非所有上下文單詞同樣重要
    • Which words are important, which ones are not?
      • infrequent words are more important than frequent ones (examples?
      • ) 不頻繁的單詞比常見單詞更重要
      • correlated words are more important than uncorrelated ones (examples?)
      • ...

→ weighing schemes (TF-IDF, PMI,...)

Weighing terms: TF-IDF (for term-document matrix)
  • tf (frequency count):
    t f ( t , d ) = log ⁡ 10 ( 1 + c o u n t ( t , d ) ) tf(t,d)=\log_{10}(1+count(t,d)) tf(t,d)=log10(1+count(t,d))

  • idf (inverse document frequency): popular terms (terms that appear in many documents) are down weighed
    t f ( t , d ) = log ⁡ 10 N d f ( t ) tf(t,d)=\log_{10}\frac{N}{df(t)} tf(t,d)=log10df(t)N

  • TF - IDF:
    t f − i d f ( t , d ) = t f ( t , d ) ∗ i d f ( t ) tf - idf(t,d) = tf(t,d) *idf(t) tf−idf(t,d)=tf(t,d)∗idf(t)

  • Many word pairs should have > 0 counts, but their corresponding matrix entries are 0s because of lacking data (data sparsity)

    → Laplace smoothing: adding 1 to every entry (pseudocount)

Pros Cons
Simple and intuitive Word/document vectors are sparse (dims are |V|, vocabulary size, or |D|, number of documents, often from 2k to 10k) → difficult for machine learning algorithms
Dimensions are meaningful (e.g., each dim is a document / a contextual word) → easy to debug and interpret (Think about Explainable AI) How to represent word meaning in a specific context?
(From sparse vectors to dense vectors)-> Employ dimensionality reduction (e.g., latent semantic analysis - LSA)
Use a different approach: prediction (coming up next)

Prediction-based Approach

Introduction to ANNs used to learn word embeddings

  • two major count-based approach methods:
    • term-document matrix 術語文檔矩陣
    • term-term matrix 術語矩陣
  • Raw frequency is bed
    • using weighing schemes to "correct" counts使用稱重方案
    • using smoothing to take into account "unseen" events使用平滑來考慮看不見的事件

Formalisation

Assumptions:

● each word w ∈ V is represented by a vector v ∈ R d (d is often smaller than 3k)

● there is a mechanism to compute the probability Pr (w|u 1 , u 2 , ..., u l) of the event that a target word w appears in a context (u 1 , u 2 , ..., u l ).

Task: find a vector v for each word w such that those probabilities are as high as

possible for each w and its context (u 1 , u 2 , ..., u l ).

We use a neural network with parameters θ to compute the probability by minimizing the cross-entropy loss使用具有最小參數θ的神經網絡。透過最小化交叉熵損失來計算概率

L ( θ ) = − ∑ ( w , u 1 , ... , u l ) ∈ D train log ⁡ Pr ⁡ ( w ∣ u 1 , ... , u l ) L(\theta) = -\sum_{(w, u_1, \ldots, u_l) \in D_{\text{train}}} \log \Pr(w \mid u_1, \ldots, u_l) L(θ)=−∑(w,u1,...,ul)∈DtrainlogPr(w∣u1,...,ul)

Bengio

CBOW:

CBOW 模型的工作原理
  1. 輸入 (Input Layer)

    • 給定一個目標詞 w t w_t wt ,選取其 前 m 個詞後 m 個詞 作為上下文詞 (context words)
    • 這些詞會從詞嵌入矩陣 CCC 中查找對應的詞向量。
  2. 投影層 (Projection Layer)

    • 將這些上下文詞對應的詞向量取平均: y = average ( w t − 1 , . . . , w t − m , w t + 1 , . . . , w t + m ) y = \text{average}(w_{t-1}, ..., w_{t-m}, w_{t+1}, ..., w_{t+m}) y=average(wt−1,...,wt−m,wt+1,...,wt+m)
    • 這一步沒有非線性變換 (例如 ReLU 或 tanh),只是簡單的平均
  3. 輸出層 (Output Layer)

    • 計算該平均向量 y y y 與詞彙矩陣 W W W 的線性變換,並使用 softmax 來預測中心詞 wtw_twt: P ( w t ∣ w t − 1 , . . . , w t − m , w t + 1 , . . . , w t + m ) = softmax ( W y ) P ( w t ∣ w t − 1 , . . . , w t − m , w t + 1 , . . . , w t + m ) = s o f t m a x ( W y ) P(w_t | w_{t-1}, ..., w_{t-m}, w_{t+1}, ..., w_{t+m}) = \text{softmax}(Wy)P(wt∣wt−1,...,wt−m,wt+1,...,wt+m)=softmax(Wy) P(wt∣wt−1,...,wt−m,wt+1,...,wt+m)=softmax(Wy)P(wt∣wt−1,...,wt−m,wt+1,...,wt+m)=softmax(Wy)
    • Softmax 的輸出是一個機率分佈,表示詞彙表 (vocabulary) 中每個詞作為中心詞的可能性。
CBOW 的特點
  1. 上下文到目標詞 :它是從上下文詞預測中心詞(這與 Skip-gram 相反,Skip-gram 是用中心詞預測周圍的上下文詞)。
  2. 計算高效 :由於使用平均詞向量,CBOW 計算速度通常比 Skip-gram 更快,尤其是在大語料庫上訓練時。
  3. 適合大規模語料庫:CBOW 在大語料庫下通常表現更穩定,適合訓練大詞彙的詞向量。
CBOW 在 NLP 任務中的影響
  • 詞向量學習 :CBOW 提供了一種高效的方式來學習詞向量,後來影響了 GloVe、FastText 等模型的發展。
  • 語意計算 :學到的詞向量可以用來計算詞語之間的語義相似性,例如餘弦相似度 (cosine similarity)
  • 下游應用 :CBOW 訓練出的詞向量可以應用於 文本分類、情感分析、機器翻譯 等 NLP 任務。
CBOW Skip-gram
目标 用上下文词预测中心词 用中心词预测上下文词
计算速度 慢(因为对每个中心词要预测多个上下文词)
适用场景 大数据、大语料库 小数据、小语料库
效果 适合学习常见词的词向量 在低频词的词向量学习上更优

word2vec

  • Skip-gram model
  • "a baby step in Deep Learning but a giant leap towards Natural Language Processing"
  • can capture linear relational meanings (i.e., analogy):
    • king - man + women = queen
Problems : biases (gender, ethnic, ...)
  • Word embeddings are learned from data → they also capture biases implicitly appearing in the data
  • Gender bias:
    • "computer_programmer" is closer to "man" than "woman"
    • "homemaker" is closer to "woman" than "man"
  • Ethnic bias:
    • African-American names are associated with unpleasant words (more than European-American names)
  • ...
    → Debiasing embeddings is a hot (and very needed) research topic

Dealing with unknown words

  • Many words are not in dictionaries
  • New words are invented everyday
  • Solution 1: using a special token #UNK # for all unknown words
  • Solution 2: using characters/sub-words instead of words
    • Characters (c-o-m-p-u-t-e-r instead of computer)
    • Subwords (com-omp-mpu-put-ute-ter instead of computer)

Word embeddings in a specific context

  • The meaning of a word standing alone can be different than its meaning in a specific context
    • He lost all of his money when the bank failed.
    • He stood on the bank of Amstel river and thought about his future.
  • Solution: w |c = f (w, c)
  • Solution 1: f is continuous w.r.t. c (contextual embeddings, e.g., ELMO, BERT - next week)
  • Solution 2: f is discrete w.r.t. c (e.g., word sense disambiguation - coming up in the next video)

Summary

  • Prediction-based approaches require neural network models, which are not
    intuitive as count-based ones
  • Low dimensional vectors (about 200-400 dimensions)
    • Dimensions are not easy to interpret
  • Robust performance for NLP tasks

延伸:Word Embeddings 進化

  1. 靜態詞嵌入(Static Embeddings)

    • Word2Vec、GloVe、FastText
    • 缺點:一個詞的向量固定,不能根據不同上下文改變語義(如「bank」的不同意思)。
  2. 上下文敏感的詞嵌入(Contextualized Embeddings)

    • ELMo、BERT、GPT
    • 解決了詞義多義性(Polysemy)問題,能夠根據上下文動態調整詞向量。

Contextualised Word Embedding

Static map 靜態地圖

f trained on large corpus Based on co-occurrence of words

相关推荐
深蓝岛3 分钟前
LSTM与CNN融合建模的创新技术路径
论文阅读·人工智能·深度学习·机器学习·lstm
图灵信徒21 分钟前
R语言数据结构与数据处理基础内容
开发语言·数据挖掘·数据分析·r语言
lzptouch1 小时前
蚁群(Ant Colony Optimization, ACO)算法
人工智能·算法·机器学习
Clain2 小时前
Ollama、LM Studio只是模型工具,这款工具比他俩更全面
人工智能·机器学习·llm
双翌视觉4 小时前
机器视觉的液晶电视OCA全贴合应用
人工智能·数码相机·机器学习·1024程序员节
青云交12 小时前
Java 大视界 -- Java 大数据在智能农业温室环境调控与作物生长模型构建中的应用
java·机器学习·传感器技术·数据处理·作物生长模型·智能农业·温室环境调控
浣熊-论文指导13 小时前
聚类与Transformer融合的六大创新方向
论文阅读·深度学习·机器学习·transformer·聚类
B站_计算机毕业设计之家16 小时前
预测算法:股票数据分析预测系统 股票预测 股价预测 Arima预测算法(时间序列预测算法) Flask 框架 大数据(源码)✅
python·算法·机器学习·数据分析·flask·股票·预测
GG向前冲16 小时前
【大数据】Spark MLlib 机器学习流水线搭建
大数据·机器学习·spark-ml
深蓝岛17 小时前
目标检测核心技术突破:六大前沿方向
论文阅读·人工智能·深度学习·计算机网络·机器学习