大模型综述论文笔记6-15

这里写自定义目录标题

  • Keywords
  • [Backgroud for LLMs](#Backgroud for LLMs)
    • [Technical Evolution of GPT-series Models](#Technical Evolution of GPT-series Models)
      • [Research of OpenAI on LLMs can be roughly divided into the following stages](#Research of OpenAI on LLMs can be roughly divided into the following stages)
        • [Early Explorations](#Early Explorations)
        • [Capacity Leap](#Capacity Leap)
        • [Capacity Enhancement](#Capacity Enhancement)
        • [The Milestones of Language Models](#The Milestones of Language Models)
  • Resources
  • Pre-training
    • [Data Collection](#Data Collection)
    • [Data Preprocessing](#Data Preprocessing)

Keywords

GPT:Generative Pre-Training

Backgroud for LLMs

Technical Evolution of GPT-series Models

Two key points to GPT's success are (I) training decoder-onlly Transformer language models that can accurately predict the next word and (II) scaling up the size of language models

Research of OpenAI on LLMs can be roughly divided into the following stages

Early Explorations

Capacity Leap

ICT

Capacity Enhancement

1.training on code data

Codex: a GPT model fine-tuned on a large corpus of GitHub

code
2.alignment with human preference

reinforcement learning from human feedback (RLHF) algorithm

Note that it seems that the wording of "instruction tuning" has seldom

been used in OpenAI's paper and documentation, which is substituted by

supervised fine-tuning on human demonstrations (i.e., the first step

of the RLHF algorithm).

The Milestones of Language Models

chatGPT(based on gpt3.5 and gpt4) and GPT-4(multimodal)

Resources

Stanford Alpaca is the first open instruct-following model fine-tuned based on LLaMA (7B).

Alpaca LoRA (a reproduction of Stanford Alpaca using LoRA)

model 、data、library

Pre-training

Data Collection

General Text Data:webpages, books, and conversational text

Specialized Text Data:Multilingual text, Scientific text, Code

Data Preprocessing

Quality Filtering

  1. The former approach trains a selection classifier based on highquality texts and leverages it to identify and filter out low quality data.
  2. heuristic based approaches to eliminate low-quality texts through a set of well-designed rules: Language based filtering, Metric based filtering, Statistic based filtering, Keyword based filtering

De-duplication

Existing work has found that duplicate data in a corpus would reduce the diversity of language models, which may cause the training process to become unstable and thus affect the model performance.

  1. Privacy Redaction: (PII:personally identifiable information )
  2. Tokenization:(It aims to segment raw text into sequences of individual tokens, which are subsequently used as the inputs of LLMs.) Byte-Pair Encoding (BPE) tokenization; WordPiece tokenization; WordPiece tokenization
相关推荐
qq_4162764218 小时前
SuperYOLO:多模态遥感图像中的超分辨率辅助目标检测之论文阅读
论文阅读·人工智能·目标检测
21级的乐未央19 小时前
论文阅读(四):Agglomerative Transformer for Human-Object Interaction Detection
论文阅读·深度学习·计算机视觉·transformer
Ayakanoinu1 天前
【论文阅读】BEVFormer
论文阅读
一点.点1 天前
FASIONAD:自适应反馈的类人自动驾驶中快速和慢速思维融合系统——论文阅读
论文阅读·语言模型·自动驾驶
远瞻。1 天前
【论文阅读】人脸修复(face restoration ) 不同先验代表算法整理2
论文阅读·算法
暖季啊1 天前
分割一切(SAM) 论文阅读:Segment Anything
论文阅读·人工智能·神经网络
远瞻。1 天前
【论文阅读】人脸修复(face restoration ) 不同先验代表算法整理
论文阅读·算法
Ayakanoinu1 天前
【论文阅读】针对BEV感知的攻击
论文阅读
开心星人3 天前
【论文阅读】UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening
论文阅读
regret~3 天前
【论文笔记】ViT-CoMer
论文阅读