大模型综述论文笔记1-5

目录

  • Keywords
  • Introduction
  • [Backgroud for LLMs](#Backgroud for LLMs)
    • [Scaling Laws for LLMs](#Scaling Laws for LLMs)
      • [KM scaling law](#KM scaling law)
      • [Chinchilla scaling law](#Chinchilla scaling law)
    • [Emergent Abilities of LLMs](#Emergent Abilities of LLMs)
      • [In-context learning](#In-context learning)
      • [Instruction following](#Instruction following)
      • [Step-by-step reasoning](#Step-by-step reasoning)
    • [Key Techniques for LLMs](#Key Techniques for LLMs)
      • Scaling
      • Training
      • [Ability eliciting](#Ability eliciting)
      • [Alignment tuning](#Alignment tuning)
      • [Tools manipulation](#Tools manipulation)

Keywords

PLMs:pre-trained language models

NLP:natural language processing

LLM:large language models

LM:language modeling

AI:artificial intelligence

SLM:statistical language models

NLM:Neural language models

RNNs:recurrent neural networks

ELMo:Embedding from Language Models

AGI:artificial general intelligence

ICL:In-context learning

Introduction

https://github.com/RUCAIBox/LLMSurvey

SLM

SLM 's basic idea is based on Markov assumption.The SLMs with a fixed context length n are also called n-gram language models.

瓶颈:维度问题,由于指数增长的转换概率需要计算,SLM无法准确估计高位语言模型

衍生:backoff estimation and Good-Tuning estimation, 用于解决数据稀疏的问题

NLM

通过神经网络来表征单词序列的概率问题。开启了用语言模型来做表征建模(representation learning, the beyond is word sequence modeling词序建模)

distributed representation of words

word prediction function conditioned on distributed word vectors

word2vec

PLM

ELMo通过bidirectional LSTM (biLSTM)网络捕获了上下文信息,并可以通过特定的下游任务进行fine-tuning.ELMo简介

BERT可以使用大规模的未标注数据进行特定的预训练任务

LLM

scaling PLMs(scaling model size or data size)

Three differences between PLMs and LLMs:

1.LLMs表现出在更小的PLMs中可能无法观察到的更惊人的能力

2.通过prompting interface来访问LLMs(eg:gpt-4 API)

3.LLMs的发展不需要明确区分以研究或是工程化为目的,LLMs的训练需要大数据处理和并行训练这些更实际的经验。

Backgroud for LLMs

LLMs refers to Transformer language models that contain hundreds of billions (or more) of params, which are trained on massive text data.

Scaling Laws for LLMs

LLMs 可以适配相同结构的transformer 并可以作为小模型的与训练模型

KM scaling law

通过 model size (N), dataset size (D), and the amount of training compute © 三个因素来衡量神经网络模型的表现

The three laws were derived by fitting the model performance with

varied data sizes (22M to 23B tokens), model sizes (768M to 1.5B

non-embedding parameters) and training compute,under some assumptions

(e.g., the analysis of one factor should be not bottlenecked by the

other two factors).

Chinchilla scaling law

.They conducted rigorous experiments by varying a larger range of

model sizes (70M to 16B) and data sizes (5B to 500B tokens) and fitted a similar

scaling law yet with different coefficients
the KM scaling law favors a larger budget allocation in model size

than the data size, while the Chinchilla scaling law argues that the

two sizes should be increased in equal scales

问题:

However, some abilities (e.g., in-context learning) are

unpredictable according to the scaling law, which can be observed only

when the model size exceeds a certain level (as discussed below).

Emergent Abilities of LLMs

emergent abilities of LLMs are formally defined as "the abilities that

are not present in small models but arise in large models"

three typical emergent abilities for LLMs:

In-context learning

assuming that the language model has been provided with a natural

language instruction and/or several task demonstrations, it can

generate the expected output for the test instances by completing the

word sequence of input text, without requiring additional training or

gradient update

Instruction following

By fine-tuning with a mixture of multi-task datasets formatted via

natural language descriptions (called instruction tuning), LLMs are

shown to perform well on unseen tasks that are also described in the

form of instructions

Step-by-step reasoning

with the chain-of-thought (CoT) prompting strategy, LLMs can solve

such tasks by utilizing the prompting mechanism that involves

intermediate reasoning steps for deriving the final answer

Key Techniques for LLMs

Scaling

larger model/data sizes and more training compute typically lead to an improved model capacity

Training

To support distributed training, several optimization frameworks have been released to facilitate the implementation and deployment of parallel algorithms, such as DeepSpeed and Megatron-LM

Ability eliciting

These abilities might not be explicitly exhibited when LLMs perform some specific tasks.As the technical approach, it is useful to design suitable task instructions or specific in-context learning strategies to elicit such abilities

Alignment tuning

they are likely to generate toxic, biased, or even harmful content for humans. It is necessary to align LLMs with human values

InstructGPT designs an effective tuning approach that enables LLMs to follow the expected instructions, which utilizes the technique of reinforcement learning with human feedback

Tools manipulation

For example, LLMs can utilize the calculator for accurate computation and employ search engines to retrieve unknown information

相关推荐
m0_6501082410 小时前
MindDrive:融合世界模型与视觉语言模型的端到端自动驾驶框架
论文阅读·自动驾驶·轨迹生成与规划·世界动作模型·e2e-ad·vlm导向评估器·minddrive
CoookeCola10 小时前
无需抠图!Qwen-Image-Layered 一键分解图像图层,支持图层级精准编辑
论文阅读·深度学习·计算机视觉·ai作画·开源·视觉检测·aigc
bylander11 小时前
【论文阅读】VTP:Towards Scalable Pre-training of Visual Tokenizers for Generation
论文阅读·图像处理·大模型
czijin11 小时前
【论文阅读】LoRA: Low-Rank Adaptation of Large Language Models
论文阅读·人工智能·语言模型
有Li12 小时前
诊断文本引导的分层分类全玻片图像表征学习|文献速递-医疗影像分割与目标检测最新技术
论文阅读·深度学习·文献·医学生
万里鹏程转瞬至1 天前
论文简读:Qwen2.5-VL Technical Report
论文阅读·深度学习·多模态
万里鹏程转瞬至1 天前
论文简读:Qwen3-VL Technical Report | Qwen3VL技术报告
论文阅读·深度学习·多模态
墨绿色的摆渡人1 天前
论文笔记(一百一十二)Pos3R: 6D Pose Estimation for Unseen Objects Made Easy
论文阅读
c0d1ng1 天前
十二月第三周周报(论文阅读)
论文阅读
Xy-unu2 天前
[LLM]AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
论文阅读·人工智能·算法·机器学习·transformer·论文笔记·剪枝