- [N-Grams: Overview](#N-Grams: Overview)
- [N-grams and Probabilities](#N-grams and Probabilities)
- [Sequence Probabilities](#Sequence Probabilities)
- [Probability of a sequence](#Probability of a sequence)
- [Sequence probability shortcomings](#Sequence probability shortcomings)
- [Approximation by N gram probabilities](#Approximation by N gram probabilities)
- Quiz
- [Starting and Ending Sentences](#Starting and Ending Sentences)
- [Start of sentence token \<s\>](#Start of sentence token <s>)
- [End of sentence token \</s\> -motivation](#End of sentence token </s> -motivation)
- [End of sentence token \</s\> -solution](#End of sentence token </s> -solution)
- Example-bigram
- Quiz
- [The N-gram Language Model](#The N-gram Language Model)
- [Count matrix](#Count matrix)
- [Probability matrix](#Probability matrix)
- [Language model](#Language model)
- [Log probability](#Log probability)
- [Generative Language model](#Generative Language model)
- [Language Model Evaluation](#Language Model Evaluation)
- [Test data](#Test data)
- Perplexity
- [Perplexity for bigram models](#Perplexity for bigram models)
- [Log perplexity](#Log perplexity)
- Example
- [Out of Vocabulary Words](#Out of Vocabulary Words)
- [Out of vocabulary words](#Out of vocabulary words)
- [Using \<UNK\> in corpus](#Using <UNK> in corpus)
- [How to create vocabulary V](#How to create vocabulary V)
- Quiz
- Smoothing
- [Missing N-grams in training corpus](#Missing N-grams in training corpus)
- Smoothing
- Backoff
- Interpolation
- Quiz
N-Grams: Overview
● Create language model (LM) from text corpus to
○ Estimate probability of word sequences
○ Estimate probability of a word following a sequence of words
● Apply this concept to autocomplete a sentence with most likely suggestions
语音识别(Speech Recognition):
拼写检查与纠正(Spelling Correction):
辅助交流(Augmentative and Alternative Communication, AAC):
●Process text corpus to N-gram language model
●Out of vocabulary words
●Smoothing for previously unseen N-grams
●Language model evaluation
N-grams and Probabilities
N-gram 是自然语言处理中用于描述文本数据的一种统计模型。简单来说,一个 N-gram 是由 N 个连续的词(words)组成的序列。在这个序列中,每个词被称作一个"gram",并且这个序列可以被用来捕捉文本中的局部上下文信息。
以下是不同 N 值的 N-gram 的一些例子:
对于 Unigram(1-gram):N=1,它只包含一个词。例如,"cat"就是一个 unigram。
对于 Bigram(2-gram):N=2,它包含两个连续的词。例如,"cat sat"就是一个 bigram。
对于 Trigram(3-gram):N=3,它包含三个连续的词。例如,"cat sat on"就是一个 trigram。
N-gram 模型在语言模型中非常重要,因为它们可以用来预测文本序列中下一个词出现的概率。例如,在一个 bigram 模型中,给定第一个词,模型可以预测第二个词出现的概率。这种模型对于诸如拼写检查、语法分析、机器翻译和语音识别等应用至关重要。
然而,N-gram 模型也存在一些局限性,比如当 N 值较大时,模型可能会遇到数据稀疏问题,因为大量的词序列在训练数据中可能只出现很少的次数或从未出现过。此外,N-gram 模型通常忽略了词序之外的上下文信息,如句法和语义。
理解 N-gram 的关键是认识到它们提供了一种简单但有效的方式来捕捉和表示文本数据中的局部依赖关系。
Corpus: I am happy because I am learning
Unigrams: {I , am, happy, because, learning}
Bigrams: {I am, am happy , happy because ...}这里I happy不是Bigrams,必须要连续的两个词;I am在语料库中出现两次,只会记录一次
Trigrams: {I am happy , am happy because , ...}
Sequence notation
w 1 m = w 1 w 2 ⋯ w m w_1^m=w_1w_2\cdots w_m w1m=w1w2⋯wm
w 1 3 = w 1 w 2 w 3 w_1^3=w_1w_2w_3 w13=w1w2w3
w m − 2 m = w m − 2 w m − 1 w m w_{m-2}^m=w_{m-2}w_{m-1}w_m wm−2m=wm−2wm−1wm
Unigram probability
假设语料库为:I am happy because I am learning
语料库大小 m = 7 m=7 m=7
对于单词I: P ( I ) = 2 7 P(I)=\cfrac{2}{7} P(I)=72
对于单词happy: P ( h a p p y ) = 1 7 P(happy)=\cfrac{1}{7} P(happy)=71
Unigram probability公式为:
P ( w ) = C ( w ) m P(w)=\cfrac{C(w)}{m} P(w)=mC(w)
Bigram probability
假设语料库为:I am happy because I am learning
则前一个单词是I,后一个单词是am的概率为: P ( a m ∣ I ) = C ( I a m ) C ( I ) = 2 2 = 1 P(am|I)=\cfrac{C(I\space am)}{C(I)}=\cfrac{2}{2}=1 P(am∣I)=C(I)C(I am)=22=1
前一个单词是I,后一个单词是happy的概率为: P ( h a p p y ∣ I ) = C ( I h a p p y ) C ( I ) = 0 2 = 0 P(happy|I)=\cfrac{C(I\space happy)}{C(I)}=\cfrac{0}{2}=0 P(happy∣I)=C(I)C(I happy)=20=0
前一个单词是am,后一个单词是learning的概率为: P ( l e a r n i n g ∣ a m ) = C ( a m l e a r n i n g ) C ( a m ) = 1 2 P(learning|am)=\cfrac{C(am\space learning)}{C(am)}=\cfrac{1}{2} P(learning∣am)=C(am)C(am learning)=21
Bigram probability公式为:
P ( y ∣ x ) = C ( x y ) ∑ w C ( x w ) = C ( x y ) C ( x ) P(y|x)=\cfrac{C(x\space y)}{\sum_wC(x\space w)}=\cfrac{C(x\space y)}{C(x)} P(y∣x)=∑wC(x w)C(x y)=C(x)C(x y)
Trigram Probability
假设语料库为:I am happy because I am learning
前两个单词是I am,后一个单词是happy的概率为: P ( h a p p y ∣ I a m ) = C ( I a m h a p p y ) C ( I a m ) = 1 2 P(happy|I\space am)=\cfrac{C(I\space am\space happy)}{C(I\space am)}=\cfrac{1}{2} P(happy∣I am)=C(I am)C(I am happy)=21
Trigram Probability公式为:
P ( w 3 ∣ w 1 2 ) = C ( w 1 2 w 3 ) C ( w 1 2 ) P(w_3|w_1^2)=\cfrac{C(w_1^2w_3)}{C(w_1^2)} P(w3∣w12)=C(w12)C(w12w3)
N -gram probability
P ( w N ∣ w 1 N − 1 ) = C ( w 1 N − 1 w N ) C ( w 1 N − 1 ) P(w_N|w_1^{N-1})=\cfrac{C(w_1^{N-1}w_N)}{C(w_1^{N-1})} P(wN∣w1N−1)=C(w1N−1)C(w1N−1wN)
分子: C ( w 1 N − 1 w N ) = C ( w 1 N ) C(w_1^{N-1}w_N)=C(w_1^{N}) C(w1N−1wN)=C(w1N)
In every place of great resort the monster was the fashion. They sang of it in the cafes, ridiculed it in the papers, and rep res ented it on the stage. " (Jules Verne, Twenty Thousand Leagues under the Sea)
In the context of our corpus, what is the probability of word "papers" following the phrase "it in the".
Answer: 1/2
解析:it in the总共出现了2次,后面接papers出现了1次
Sequence Probabilities
Probability of a sequence
P ( A , B , C , D ) = P ( A ) P ( B ∣ A ) P ( C ∣ A , B ) P ( D ∣ A , B , C ) P(A, B,C, D)= P(A)P(B|A)P(C|A, B)P(D|A, B, C) P(A,B,C,D)=P(A)P(B∣A)P(C∣A,B)P(D∣A,B,C)
P ( B ∣ A ) = P ( A , B ) P ( A ) ⇒ P ( A , B ) = P ( A ) P ( B ∣ A ) P(B|A)=\cfrac{P(A,B)}{P(A)}\xRightarrow{} P(A,B)=P(A)P(B|A) P(B∣A)=P(A)P(A,B) P(A,B)=P(A)P(B∣A)
P(the teacher drinks tea)=P(the)P(teacher|the)P(drinks |the teacher)P(tea |the teacher drinks)
Sequence probability shortcomings
最大的问题:Corpus almost never contains the exact sentence we're interested in or even its longer subsequences!
P ( t e a ∣ t h e t e a c h e r d r i n k s ) = C ( t h e t e a c h e r d r i n k s t e a ) C ( t h e t e a c h e r d r i n k s ) P(tea |the\space teacher\space drinks)=\cfrac{C(the\space teacher\space drinks\space tea)}{C(the\space teacher\space drinks)} P(tea∣the teacher drinks)=C(the teacher drinks)C(the teacher drinks tea)
可以预想到分子和分母项在语料中出现的次数估计为0,会使得P(the teacher drinks tea)计算依赖相乘的结果也为0
Approximation by N gram probabilities
P ( t e a ∣ t h e t e a c h e r d r i n k s ) ≈ P ( t e a ∣ d r i n k s ) P(tea |the\space teacher\space drinks)\approx P(tea|drinks) P(tea∣the teacher drinks)≈P(tea∣drinks)
P ( t h e t e a c h e r d r i n k s t e a ) = P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t h e t e a c h e r ) P ( t e a ∣ t h e t e a c h e r d r i n k s ) ≈ P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P(the\space teacher\space drinks\space tea)=P(the)P(teacher|the)P(drinks |the\space teacher)P(tea |the\space teacher\space drinks)\\ \approx P(the)P(teacher|the)P(drinks |teacher)P(tea |drinks) P(the teacher drinks tea)=P(the)P(teacher∣the)P(drinks∣the teacher)P(tea∣the teacher drinks)≈P(the)P(teacher∣the)P(drinks∣teacher)P(tea∣drinks)
当然,还可以根据Markov assumption: only last N words matter
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − 1 ) P(w_n | w_1^{n-1}) \approx P(w_n | w_{n-1}) P(wn∣w1n−1)≈P(wn∣wn−1)
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − N + 1 n − 1 ) P(w_n | w_1^{n-1}) \approx P(w_n|w_{n-N+1}^{n-1}) P(wn∣w1n−1)≈P(wn∣wn−N+1n−1)
P ( w 1 n ) ≈ P ( w 1 ) P ( w 2 ∣ w 1 ) ⋯ P ( w n ∣ w n − 1 ) P(w_1^n)\approx P(w_1)P(w_2|w_1)\cdots P(w_n|w_{n-1}) P(w1n)≈P(w1)P(w2∣w1)⋯P(wn∣wn−1)
Given these conditional probabilities
P(Mary|likes) =0.2;
P(likes|Mary) =0.3;
Approximate the probability of the following sentence with bigrams: "Mary likes cats"
解析:P(Mary likes cats)=P(Mary)P(likes|Mary)P(cats|likes)=0.1×0.3×0.1=0.003
Starting and Ending Sentences
Start of sentence token <s>
P ( t h e t e a c h e r d r i n k s t e a ) ≈ P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P(the\space teacher\space drinks\space tea) \approx P(the)P(teacher|the)P(drinks |teacher)P(tea |drinks) P(the teacher drinks tea)≈P(the)P(teacher∣the)P(drinks∣teacher)P(tea∣drinks)
可以看到第一个单词没有前置词,无法使用Bigram来计算条件概率,因此,我们通常会加上一个特殊项,使得上面的公式右边每一项都变成Bigram,the teacher drinks tea就变成了<s> the teacher drinks tea,概率计算变成:
P ( < s > t h e t e a c h e r d r i n k s t e a ) ≈ P ( t h e ∣ < s > ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P(<s>\space the\space teacher\space drinks\space tea) \approx P(the|<s>)P(teacher|the)P(drinks |teacher)P(tea |drinks) P(<s> the teacher drinks tea)≈P(the∣<s>)P(teacher∣the)P(drinks∣teacher)P(tea∣drinks)
P ( t h e t e a c h e r d r i n k s t e a ) ≈ P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t h e t e a c h e r ) P ( t e a ∣ t e a c h e r d r i n k s ) P(the\space teacher\space drinks\space tea)\approx P(the)P(teacher|the)P(drinks| the\space teacher)P(tea|teacher\space drinks) P(the teacher drinks tea)≈P(the)P(teacher∣the)P(drinks∣the teacher)P(tea∣teacher drinks)
需要加上两个<s>,得到:<s> <s> the teacher drinks tea
End of sentence token </s> -motivation
P ( y ∣ x ) = C ( x , y ) ∑ w C ( x , w ) = C ( x , y ) C ( x ) P(y|x)=\cfrac{C(x,y)}{\sum_wC(x,w)}=\cfrac{C(x,y)}{C(x)} P(y∣x)=∑wC(x,w)C(x,y)=C(x)C(x,y)
当我们计算最后一个词的时候,上面公式的分母不一定相等,即: ∑ w C ( x , w ) ≠ C ( x ) \sum_wC(x,w)\neq C(x) ∑wC(x,w)=C(x)
<s> Lyn drinks chocolate
<s> John drinks
∑ w C ( d r i n k s , w ) = 1 \sum_wC(drinks,w)=1 ∑wC(drinks,w)=1
∑ w C ( d r i n k s ) = 2 \sum_wC(drinks)=2 ∑wC(drinks)=2
<s> yes no
<s> yes yes
<s> no no
<s> yes yes
<s> yes no
<s> no no
<s> no yes
以第一个<s> yes yes为例,计算其出现概率:
P ( < s > y e s y e s ) = P ( y e s ∣ < s > ) × P ( y e s ∣ y e s ) = C ( < s > , y e s ) ∑ w C ( < s > , w ) × C ( y e s , y e s ) ∑ w C ( y e s , w ) = 2 3 × 1 2 = 1 3 P(<s>\space yes\space yes)=P(yes|<s>)\times P(yes|yes)\\ =\cfrac{C(<s>,yes)}{\sum_wC(<s>,w)}\times\cfrac{C(yes,yes)}{\sum_wC(yes,w)}\\ =\cfrac{2}{3}\times\cfrac{1}{2}=\cfrac{1}{3} P(<s> yes yes)=P(yes∣<s>)×P(yes∣yes)=∑wC(<s>,w)C(<s>,yes)×∑wC(yes,w)C(yes,yes)=32×21=31
同理,可以计算得到<s> yes no出现概率为:1/3;<s> no no出现概率为:1/3;<s> no yes 出现概率为:0;
也就是说所有长度为2的句子出现概率总和为: ∑ 2 w o r d P ( ⋯ ) = 1 / 3 + 1 / 3 + 1 / 3 + 0 = 1 \sum_{2\space word}P(\cdots)=1/3+1/3+1/3+0=1 ∑2 wordP(⋯)=1/3+1/3+1/3+0=1
∑ 2 w o r d P ( ⋯ ) + ∑ 3 w o r d P ( ⋯ ) + ⋯ = 1 \sum_{2\space word}P(\cdots)+\sum_{3\space word}P(\cdots)+\cdots=1 2 word∑P(⋯)+3 word∑P(⋯)+⋯=1
End of sentence token </s> -solution
解决方法就是在句末加</s>,例如:<s> the teacher drinks tea </s>,出现概率为:
P ( t h e ∣ < s > ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P ( < / s > ∣ t e a ) P(the|<s>)P(teacher|the)P(drinks |teacher)P(tea |drinks)P(</s>|tea) P(the∣<s>)P(teacher∣the)P(drinks∣teacher)P(tea∣drinks)P(</s>∣tea)
the teacher drinks tea=> <s> <s> the teacher drinks tea </s>
<s> Lyn drinks chocolate </s>
<s> John drinks </s>
∑ w C ( d r i n k s , w ) = 2 \sum_wC(drinks,w)=2 ∑wC(drinks,w)=2
∑ w C ( d r i n k s ) = 2 \sum_wC(drinks)=2 ∑wC(drinks)=2
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
P ( J o h n ∣ < s > ) = 1 3 P ( < / s > ∣ t e a ) = 1 1 P(John|<s>)=\cfrac{1}{3}\quad P(</s>|tea)=\cfrac{1}{1} P(John∣<s>)=31P(</s>∣tea)=11
P ( c h o c o l a t e ∣ e a t s ) = 1 1 P ( L y n ∣ < s > ) = 2 3 P(chocolate |eats )=\cfrac{1}{1}\quad P(Lyn |<s>)=\cfrac{2}{3} P(chocolate∣eats)=11P(Lyn∣<s>)=32
P ( s e n t e n c e ) = 2 3 × 1 2 × 1 2 × 2 2 = 1 6 P(sentence)=\cfrac{2}{3}\times\cfrac{1}{2}\times\cfrac{1}{2}\times\cfrac{2}{2}=\cfrac{1}{6} P(sentence)=32×21×21×22=61
Given these conditional probabilities
P(likes|Mary) =0.3;
Approximate the probability of the following sentence with bigrams: "<s> Mary likes cats </s>"
Answer: 0.0036
The N-gram Language Model
Count matrix
P ( w n ∣ w n − N + 1 n − 1 ) = C ( w n − N + 1 n − 1 , w n ) C ( w n − N + 1 n − 1 ) P(w_n|w_{n-N+1}^{n-1})=\cfrac{C(w_{n-N+1}^{n-1},w_n)}{C(w_{n-N+1}^{n-1})} P(wn∣wn−N+1n−1)=C(wn−N+1n−1)C(wn−N+1n−1,wn)
分子: C ( w n − N + 1 n − 1 , w n ) C(w_{n-N+1}^{n-1},w_n) C(wn−N+1n−1,wn)
Count matrix计算了在语料库中出现的所有共现次数。
Bigram count matrix实例:
Corpus:<s> I study I learn </s>
上面的study I在语料库出现1次
Probability matrix
Divide each cell by its row sum
s u m ( r o w ) = ∑ w ∈ V C ( w n − N + 1 n − 1 , w n ) = C ( w n − N + 1 n − 1 ) sum(row)=\sum_{w\in V}C(w_{n-N+1}^{n-1},w_n)=C(w_{n-N+1}^{n-1}) sum(row)=w∈V∑C(wn−N+1n−1,wn)=C(wn−N+1n−1)
根据Count matrix计算每行的求和
然后计算概率得到Probability matrix:
Language model
通过Probability matrix,Language model可以计算:
○ Sentence probability
○ Next word prediction
例如,根据上一节的Probability matrix,计算<s> I learn </s>这个句子的概率:
P ( s e n t e n c e ) = P ( I ∣ < s > ) P ( l e a r n ∣ I ) P ( < / s > ∣ l e a r n ) = 1 × 0.5 × 1 = 0.5 P(sentence)=P(I|<s>)P(learn|I)P(</s>|learn)=1\times0.5\times1=0.5 P(sentence)=P(I∣<s>)P(learn∣I)P(</s>∣learn)=1×0.5×1=0.5
Log probability
P ( w 1 n ) ≈ ∏ i = 1 n P ( w i ∣ w i − 1 ) P(w_1^n ) \approx\prod_{i=1}^{n} P(w_i | w_{i-1}) P(w1n)≈i=1∏nP(wi∣wi−1)
log ( P ( w 1 n ) ) ≈ ∑ i = 1 n log ( P ( w i ∣ w i − 1 ) ) \log(P(w_1^n ) )\approx\sum_{i=1}^{n}\log( P(w_i | w_{i-1})) log(P(w1n))≈i=1∑nlog(P(wi∣wi−1))
Generative Language model
- Choose sentence start
- Choose next bigram starting with previous word
- Continue until </s> is picked
Language Model Evaluation
Test data
For smaller corpora | For large corpora (typical for text) | |
Train | 80% Train | 98% |
Validation | 10% Validation | 1% |
Test | 10% Validation | 1% |
●split method
对于Random short sequences
PP ( W ) = P ( w 1 , w 2 , . . . , w m ) − 1 m \text{PP}(W) = P(w_1 ,w_2 ,...,w_m)^{-\frac{1}{m}} PP(W)=P(w1,w2,...,wm)−m1
P ( w 1 , w 2 , . . . , w m ) P(w_1 ,w_2 ,...,w_m) P(w1,w2,...,wm)是语言模型对观测到的词序列的概率的乘积
m m m 是词序列中的词的总数。
具体来说, P ( w 1 , w 2 , . . . , w N ) P(w_1 ,w_2 ,...,w_N) P(w1,w2,...,wN) 可以展开为:
P ( w 1 , w 2 , . . . , w N ) = ∏ i = 1 N P ( w i ∣ w 1 , w 2 , . . . , w i − 1 ) P(w_1, w_2, ..., w_N) = \prod_{i=1}^{N} P(w_i | w_1, w_2, ..., w_{i-1}) P(w1,w2,...,wN)=i=1∏NP(wi∣w1,w2,...,wi−1)
w i w_i wi表示序列中的第 i i i 个词。
P ( w i ∣ w − 1 , w 2 , . . . , w i − 1 ) P(w_ i ∣w-1 ,w_2 ,...,w_{i−1} ) P(wi∣w−1,w2,...,wi−1) 是给定前 i − 1 i−1 i−1 个词的情况下,第 i i i 个词出现的概率。
Perplexity的计算公式中的 P − 1 N P^{-\frac{1}{N}} P−N1 表示的是所有词的概率的几何平均值的倒数。几何平均值可以看作是所有概率乘积的N次方根,而取倒数是为了将平均值转换为原始概率的尺度。
Smaller perplexity = better model
Character level models PP < word based models PP
Perplexity for bigram models
P P ( W ) = ∏ i = 1 m ∏ j = 1 ∣ s i ∣ 1 P ( w j ( i ) ∣ w j − 1 ( i ) ) m PP(W)=\sqrt[m]{\prod_{i=1}^m\prod_{j=1}^{|s_i|}\cfrac{1}{P(w_j^{(i)}|w_{j-1}^{(i)})}} PP(W)=mi=1∏mj=1∏∣si∣P(wj(i)∣wj−1(i))1
w j ( i ) w_j^{(i)} wj(i)表示第i个句子中的第j个词
concatenate all sentences in W
P P ( W ) = ∏ i = 1 m 1 P ( w i ∣ w i − 1 ) m PP(W)=\sqrt[m]{\prod_{i=1}^m\cfrac{1}{P(w_i|w_{i-1})}} PP(W)=mi=1∏mP(wi∣wi−1)1
w i w_{i} wi表示test set中第i个词
Log perplexity
log P P ( W ) = 1 m ∑ i = 1 m log 2 ( P ( w i ∣ w i − 1 ) ) \log PP(W)=\cfrac{1}{m}\sum_{i=1}^m\log_2(P(w_i|w_{i-1})) logPP(W)=m1i=1∑mlog2(P(wi∣wi−1))

Training 38 million words, test 1.5 million words, WSJ corpus
Perplexity Unigram: 962 Bigram: 170 Trigram: 109
WSJ corpus,全称为Wall Street Journal (WSJ) Corpus,是一个广泛使用的文本语料库,它基于《华尔街日报》的文本内容。这个语料库在自然语言处理(NLP)领域非常知名,特别是用于语言模型的训练和评估。
Out of Vocabulary Words
Out of vocabulary words
Closed vs. Open vocabularies
Closed Vocabularies(封闭词汇表):
Open Vocabularies(开放词汇表):
在这种设置下,模型通常使用子词分割(subword segmentation)技术,如Byte Pair Encoding(BPE)或WordPiece,来处理不在训练集中的词。
Unknown word = Out of vocabulary word (OOV)
special tag <UNK> in corpus and in input
Using <UNK> in corpus
● Create vocabulary V
● Replace any word in corpus and not in V by <UNK>
● Count the probabilities with <UNK> as with any other word
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
将词表门槛定为最少出现两次:Min frequency f=2
<s> Lyn drinks chocolate </s>
<s> <UNK> drinks <UNK> </s>
<s> Lyn <UNK> chocolate </s>
Lyn, drinks, chocolate
<s>Adam drinks chocolate</s>
<s><UNK> drinks chocolate</s>
How to create vocabulary V
- 设定单词最小出现频率,大于该频率的进入词表,否则设置为UNK
- 设定词表最大容量 ∣ V ∣ |V| ∣V∣,按单词出现频率排序,将前 ∣ V ∣ |V| ∣V∣个单词包含进词表,其他的设置为UNK
在比较困惑度的时候,only compare LMs with the same V
Given the training corpus and minimum word frequency=2, how would the vocabulary for corpus
preprocessed with <UNK> look like?
"<s> I am happy I am learning </s> <s> I am happy I can study </s>"
V = (I,am,happy)
Missing N-grams in training corpus
Problem: N-grams made of known words still might be missing in the training corpus
例如,语料库有"John","eats",但是没有"John eats",此时"John eats"的计数为0,其bigram概率也为0,会导致整个句子出现概率也为0
Add-one smoothing (Laplacian smoothing)
P ( w n ∣ w n − 1 ) = C ( w n − 1 , w n ) + 1 ∑ w ∈ V ( C ( w n − 1 , w n ) + 1 ) = C ( w n − 1 , w n ) + 1 C ( w n − 1 ) + V P(w_n|w_{n-1})=\cfrac{C(w_{n-1},w_n)+1}{\sum_{w\in V}(C(w_{n-1},w_n)+1)}=\cfrac{C(w_{n-1},w_n)+1}{C(w_{n-1})+V} P(wn∣wn−1)=∑w∈V(C(wn−1,wn)+1)C(wn−1,wn)+1=C(wn−1)+VC(wn−1,wn)+1
Add-one smoothing需要在词表足够大的情况下使用,否则会使得缺失单词概率过高。
如果语料库非常大,则可以使用Add k smoothing(可用在3gram、4gram等高阶gram上):
P ( w n ∣ w n − 1 ) = C ( w n − 1 , w n ) + k ∑ w ∈ V ( C ( w n − 1 , w n ) + k ) = C ( w n − 1 , w n ) + k C ( w n − 1 ) + k × V P(w_n|w_{n-1})=\cfrac{C(w_{n-1},w_n)+k}{\sum_{w\in V}(C(w_{n-1},w_n)+k)}=\cfrac{C(w_{n-1},w_n)+k}{C(w_{n-1})+k\times V} P(wn∣wn−1)=∑w∈V(C(wn−1,wn)+k)C(wn−1,wn)+k=C(wn−1)+k×VC(wn−1,wn)+k
Advanced methods:
Kneser-Ney Smoothing(Kneser-Ney 平滑):
Kneser-Ney 由 Reinhard Kneser 和 Hermann Ney 提出,是一种用于计算条件概率分布的平滑技术。
Kneser-Ney 考虑了词的上下文,通过加权平均的方式来更新概率,其中权重取决于词在语料库中的相对频率。
Good-Turing Smoothing(Good-Turing 平滑):
Good-Turing smoothing 是由I. J. Good提出的,用于估计在语料库中未出现过的词的概率。
Good-Turing 方法通过将概率质量从高频词转移到低频词来实现平滑,特别是对于那些在训练语料中未出现过的词。
这种方法简单且计算效率高,但可能不如 Kneser-Ney 方法那样灵活,因为它不区分不同上下文中的词。
两种平滑方法各有优势和局限性。Kneser-Ney smoothing 通常在实际应用中表现更好,因为它考虑了词的上下文信息,但计算复杂度较高。Good-Turing smoothing 则因其简单性和效率而在某些情况下被采用,尤其是在资源受限的情况下。
If N-gram missing => use (N-1)-gram, ...有两种backoff方式
第一种是直接替换:Probability discounting e.g. Katz backoff
第二种是乘以某个常数(0.4比较好)后替换:"Stupid" backoff
系数 λ \lambda λ可以通过训练来确定
Corpus: "I am happy I am learning"
In the context of our corpus, what is the estimated probability of word "can" following the word "I" using the
bigram model and add k smoothing where k=3.
P(can|I)=P(can|I) = 3/(2+3×4)