大模型综述论文笔记6-15

这里写自定义目录标题

  • Keywords
  • [Backgroud for LLMs](#Backgroud for LLMs)
    • [Technical Evolution of GPT-series Models](#Technical Evolution of GPT-series Models)
      • [Research of OpenAI on LLMs can be roughly divided into the following stages](#Research of OpenAI on LLMs can be roughly divided into the following stages)
        • [Early Explorations](#Early Explorations)
        • [Capacity Leap](#Capacity Leap)
        • [Capacity Enhancement](#Capacity Enhancement)
        • [The Milestones of Language Models](#The Milestones of Language Models)
  • Resources
  • Pre-training
    • [Data Collection](#Data Collection)
    • [Data Preprocessing](#Data Preprocessing)

Keywords

GPT:Generative Pre-Training

Backgroud for LLMs

Technical Evolution of GPT-series Models

Two key points to GPT's success are (I) training decoder-onlly Transformer language models that can accurately predict the next word and (II) scaling up the size of language models

Research of OpenAI on LLMs can be roughly divided into the following stages

Early Explorations

Capacity Leap

ICT

Capacity Enhancement

1.training on code data

Codex: a GPT model fine-tuned on a large corpus of GitHub

code
2.alignment with human preference

reinforcement learning from human feedback (RLHF) algorithm

Note that it seems that the wording of "instruction tuning" has seldom

been used in OpenAI's paper and documentation, which is substituted by

supervised fine-tuning on human demonstrations (i.e., the first step

of the RLHF algorithm).

The Milestones of Language Models

chatGPT(based on gpt3.5 and gpt4) and GPT-4(multimodal)

Resources

Stanford Alpaca is the first open instruct-following model fine-tuned based on LLaMA (7B).

Alpaca LoRA (a reproduction of Stanford Alpaca using LoRA)

model 、data、library

Pre-training

Data Collection

General Text Data:webpages, books, and conversational text

Specialized Text Data:Multilingual text, Scientific text, Code

Data Preprocessing

Quality Filtering

  1. The former approach trains a selection classifier based on highquality texts and leverages it to identify and filter out low quality data.
  2. heuristic based approaches to eliminate low-quality texts through a set of well-designed rules: Language based filtering, Metric based filtering, Statistic based filtering, Keyword based filtering

De-duplication

Existing work has found that duplicate data in a corpus would reduce the diversity of language models, which may cause the training process to become unstable and thus affect the model performance.

  1. Privacy Redaction: (PII:personally identifiable information )
  2. Tokenization:(It aims to segment raw text into sequences of individual tokens, which are subsequently used as the inputs of LLMs.) Byte-Pair Encoding (BPE) tokenization; WordPiece tokenization; WordPiece tokenization
相关推荐
薛定e的猫咪3 天前
【AAAI 2025】基于扩散模型的昂贵多目标贝叶斯优化
论文阅读·人工智能·算法
YMWM_3 天前
论文阅读“SimVLA: A Simple VLA Baseline for Robotic Manipulation“
论文阅读·vla
m0_650108243 天前
VLN-Zero:零样本机器人导航的神经符号视觉语言规划框架
论文阅读·零样本·机器人导航·视觉语言导航·未知环境快速适配·符号化场景图·vlm推理
晓山清4 天前
【论文阅读】Self-supervised Learning of Person-specific Facial Dynamics for APR
论文阅读
张较瘦_4 天前
[论文阅读] AI + 教育 | 不是单纯看视频!软件工程培训的游戏化融合之道
论文阅读·人工智能·软件工程
张较瘦_4 天前
[论文阅读] AI + 软件工程 | 用统计置信度破解AI功能正确性评估难题——SCFC方法详解
论文阅读·人工智能·软件工程
Matrix_115 天前
论文阅读--Agent AI 探索多模态交互的前沿领域(二)
论文阅读·人工智能
万里鹏程转瞬至6 天前
论文简读 | TurboDiffusion: Accelerating Video Diffusion Models by 100–200 Times
论文阅读·深度学习·aigc
Matrix_116 天前
论文阅读--Agent AI 探索多模态交互的前沿领域(一)
论文阅读·人工智能
@––––––7 天前
论文阅读笔记:π 0 : A Vision-Language-Action Flow Model for General Robot Control
论文阅读·笔记