大模型综述论文笔记6-15

这里写自定义目录标题

  • Keywords
  • [Backgroud for LLMs](#Backgroud for LLMs)
    • [Technical Evolution of GPT-series Models](#Technical Evolution of GPT-series Models)
      • [Research of OpenAI on LLMs can be roughly divided into the following stages](#Research of OpenAI on LLMs can be roughly divided into the following stages)
        • [Early Explorations](#Early Explorations)
        • [Capacity Leap](#Capacity Leap)
        • [Capacity Enhancement](#Capacity Enhancement)
        • [The Milestones of Language Models](#The Milestones of Language Models)
  • Resources
  • Pre-training
    • [Data Collection](#Data Collection)
    • [Data Preprocessing](#Data Preprocessing)

Keywords

GPT:Generative Pre-Training

Backgroud for LLMs

Technical Evolution of GPT-series Models

Two key points to GPT's success are (I) training decoder-onlly Transformer language models that can accurately predict the next word and (II) scaling up the size of language models

Research of OpenAI on LLMs can be roughly divided into the following stages

Early Explorations

Capacity Leap

ICT

Capacity Enhancement

1.training on code data

Codex: a GPT model fine-tuned on a large corpus of GitHub

code
2.alignment with human preference

reinforcement learning from human feedback (RLHF) algorithm

Note that it seems that the wording of "instruction tuning" has seldom

been used in OpenAI's paper and documentation, which is substituted by

supervised fine-tuning on human demonstrations (i.e., the first step

of the RLHF algorithm).

The Milestones of Language Models

chatGPT(based on gpt3.5 and gpt4) and GPT-4(multimodal)

Resources

Stanford Alpaca is the first open instruct-following model fine-tuned based on LLaMA (7B).

Alpaca LoRA (a reproduction of Stanford Alpaca using LoRA)

model 、data、library

Pre-training

Data Collection

General Text Data:webpages, books, and conversational text

Specialized Text Data:Multilingual text, Scientific text, Code

Data Preprocessing

Quality Filtering

  1. The former approach trains a selection classifier based on highquality texts and leverages it to identify and filter out low quality data.
  2. heuristic based approaches to eliminate low-quality texts through a set of well-designed rules: Language based filtering, Metric based filtering, Statistic based filtering, Keyword based filtering

De-duplication

Existing work has found that duplicate data in a corpus would reduce the diversity of language models, which may cause the training process to become unstable and thus affect the model performance.

  1. Privacy Redaction: (PII:personally identifiable information )
  2. Tokenization:(It aims to segment raw text into sequences of individual tokens, which are subsequently used as the inputs of LLMs.) Byte-Pair Encoding (BPE) tokenization; WordPiece tokenization; WordPiece tokenization
相关推荐
有Li1 小时前
探索医学领域多模态人工智能的发展图景:技术挑战与临床应用的范围综述|文献速递-医学影像算法文献分享
论文阅读·人工智能·医学生
元让_vincent15 小时前
论文Review LIO Multi-session Voxel-SLAM | 港大MARS出品!体素+平面特征的激光SLAM!经典必读!
论文阅读·平面·自动驾驶·激光点云·激光slam
星夜Zn2 天前
生成式人工智能展望报告-欧盟-04-社会影响与挑战
论文阅读·人工智能·大语言模型·发展报告·ai社会影响
s1ckrain2 天前
【论文阅读】Editing Large Language Models: Problems, Methods, and Opportunities
论文阅读·人工智能·语言模型·大模型可解释性
勤奋的小笼包2 天前
论文阅读笔记:《Dataset Condensation with Distribution Matching》
论文阅读·人工智能·笔记
一枚射手座的程序媛2 天前
论文笔记:Bundle Recommendation and Generation with Graph Neural Networks
论文阅读
一枚射手座的程序媛2 天前
【论文笔记】Multi-Behavior Graph Neural Networks for Recommender System
论文阅读
张较瘦_2 天前
[论文阅读] 人工智能 + 软件工程 | Trae Agent:让AI智能体高效解决仓库级软件问题,登顶SWE-bench排行榜
论文阅读·人工智能·软件工程
ZHANG8023ZHEN2 天前
ModeSeq论文阅读
论文阅读
张较瘦_2 天前
[论文阅读] 人工智能 + 软件工程 | GitHub Marketplace中CI Actions的功能冗余与演化规律研究
论文阅读·人工智能·软件工程