1.基础部分
Word2Vec
Efficient Estimation of Word Representations in Vector Space
https://arxiv.org/abs/1301.3781v3
Transformer
attention is all you need
https://arxiv.org/abs/1706.03762
BERT
Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/abs/1810.04805
ERNIE
https://arxiv.org/pdf/1904.09223
GPT
gpt1: Improving Language Understanding by Generative Pre-Training
gpt2: Language Models are Unsupervised Multitask Learners
gpt3: Language Models are Few-Shot Learners
2.进阶部分
roberta模型
RoBERTa: A Robustly Optimized BERT Pretraining Approach