Intension:XOR is not linear-seperable
ML的本质都是分类,对线性不可分,一方面SVM映射到高维,
t a n h ( α ) = e x p ( α ) − e x p ( − α ) e x p ( α ) + e x p ( − α ) tanh(\alpha)=\frac{exp(\alpha)-exp(-\alpha)}{exp(\alpha)+exp(-\alpha)} tanh(α)=exp(α)+exp(−α)exp(α)−exp(−α),for easy normalization
梯度下降->Newton method
一阶导用于梯度下降,二阶导为动量,用于调整学习率
不同学习率调整方法的比较:
RMSProp对序列任务表现较为accurate;
Adam下降较快,测试效果较差;
- n-元句子的概率计算公式? MLE for句子的最大似然概率
P ( w 1 . . . w n ) ≈ ∏ i = 1 n P ( w i ∣ w i − 1 . . . w i − k ) P(w_1...w_n)\approx \prod_{i=1}^n P(w_i|w_{i-1}...w_{i-k}) P(w1...wn)≈∏i=1nP(wi∣wi−1...wi−k) - 学习方式:
- continuous bag of words
- skip-gram
?NN全参数可学习,
nn到语言模型的代入is simple,what's difficult?
- Does Neural LM need smoothing?
No,even if there are some variable is 0,the propagation proceeds successfully.
for output random vector,which is unseen,can be expressed.
But how does we predict from embedding vector to word.
- Linear+Softmax to one-hot vector to predict.
-
How does Neural LM capture long-term n-gram dependencies?
LSTM for
UNK is to represent every unseen words. -
语言的基础特征:前后缀、
target hw1: 使用训练数据构建统计语言模型