A Brief History: from GPT-1 to GPT-3

This is my reading notes of 《Developing Apps with GPT-4 and ChatGPT》.

In this section, we will introduce the evolution of the OpenAI GPT medels from GPT-1 to GPT-4.

GPT-1

In mid-2018, OpenAI published a paper titled "Improving Language Understanding by Generative Pre-Training" by Radford, Alec, et al. in which they introduced the Generative Pre-trained Transformer, also known as GPT-1.

The full name of GPT is Generative Pre-trained Transformer.

Before GPT-1, the common approach to building high-performance NLP neural models relied on supervised learning which needs large amounts of manually labeled data. However, the need for large amounts of well-annotated supervised data has limited the performance of these techniques because such datasets are both difficult and expensive to generate.

The authors of GPT-1 proposed a new learning process where an unsupervised pre-training step is introduced. In this step, no labeled data is needed.Instead, the model is trained to predict what the next token is.

The GPT-1 model used the BooksCorpus dataset for the pre-training which is a dataset containing the text of approximately 11,000 unpublished books.

In the unsupervised learning phase, the model learned to predict the next item in the texts of the BookCorpus dataset.

However, because the model is small, it was unable to perform complex tasks without fine-tuning.

To adapt the model to a specific target task, a second supervised learning step, called fine-tuning, was performed on a small set fo manually labeled data.

The process of fine-tuning allowed the parameters learned in the initial pre-training phase to be modified to fit the task at hand better.

In contrast to other NLP neural models, GPT-1 showed remarkable performance on several NLP tasks using only a small amount of manually labeled data for fine-tuning.

NOTE

GPT-1 was trained in two stages:


Stage 1: Unsupervised Pre-training
Goal: To learn general language patterns and presentations.
Method: The model is trained to predict the next token in the sentence.
Data: A large unlabeled text dataset
Type of Learning: Unsupervised learning -- no manual labels are needed.
Outcome: The model learns a strong general understanding of language, but it's not yet specialized for specific tasks(e.g., sentiment analysis or question answering)


Stage 2: Supervise Fine-tuning
Goal: To adapt the pre-trained model to a specific downstream task.
Method: The model is further trained on a small labeled dataset.
Type of Learning: Supervised learning -- the data includes input-output pairs(e.g., a sentence and its sentiment label).
Outcome: The model's parameters are fine-tuned so it performs better on that particular task.


Summary:
  • Pre-training teaches the model how language works(general knowledge).
  • Fine-tuning teaches the model how to perform a specific task(specialized skills).

A good analogy would be:

The model first read lots of books to be smart(pre-training), and then takes a short course to learn a particular job(fine-tuning).

The architecture of GPT-1 was a similar encoder from the original transformer, introduced in 2017, with 117 million parameters.

This first GPT model paved the way for future models with larger datasets and more parameters to take better advantage of the potential of the transformer architectures.

GPT-2

In early 2019, OpenAI proposed GPT-2, a scaled-up version of the GPT-1 model, increasing the number of parameters and the size of the training dataset tenfold.

The number of parameters of this new version was 1.5 billion, trained on 40 GB of text.

In November 2019, OpenAI released the full version of the GPT-2 language model.

GPT-2 is publicly available and can be downloaded from Huggingface or GitHub.

GPT-2 showed that training a larger language model on a larger dataset improves the ability of a language model to understand tasks and outperforms the state-of-art on many jobs.

GPT-3

GPT-3 was released by OpenAI in June 2020.

The main differences between GPT-2 and GPT-3 are the size of the model and the quantity of data used for the training.

GPT-3 is a much larger model, with 175 billion parameters, allowing it to capture more complex pattern.

In addition, GPT-3 is trained on a more extensive dataset.

This includes Common Crawl, a large web archive containing text from billions of web pages and other sources, such as Wikipedia.

This training dataset, which includes content from websites, books, and articles, allows GPT-3 to develop a deeper understanding of the language and context.

As a result, GPT-3 improved performance on a variety of linguistic tasks.

GPT-3 eliminates the need for a fine-tuning step that was mandatory for its predecessors.

NOTE

How GPT-3 eliminates the need for fine-tuning:

GPT-3 is trained on a massive amount of data, and it's much larger than GPT-1 and GPT-2 -- with 175 billion parameters.

Because of the scale, GPT-3 learns very strong general language skills during pre-training alone.


Instead of fine-tuning, GPT-3 uses:
  1. Zero-shot learning
    Just give it a task description in plain text -- no example needed.
  2. One-shot learning
    Give it one example in the prompt to show what kind of answer you want.
  3. Few-shot learning
    Give it a few examples in the prompt, and it learns the pattern on the fly.

So in short:

GPT-3 doesn't need fine-tuning because it can understand and adapt to new tasks just by seeing a few examples in the input prompt --- thanks to its massive scale and powerful pre-training.


GPT-3 is indeed capable of handling many tasks without traditional fine-tuning, but that doesn't mean it completely lacks support for or never uses fine-tuning.

GPT-3's default approach: Few-shot / Zero-shot Learning

What makes GPT-3 so impressive is that it can:

  • Perform tasks without retraining (fine-tuning)
  • Learn through prompts alone
Does GPT-3 support fine-tuning?

Yes! OpenAI eventually provided a fine-tuning API for GPT-3, which is useful in scenarios like:

  • When you have domain-specific data (e.g., legal, medical).

  • When you want the model to maintain a consistent tone or writing style.

  • When you need a stable and structured output format (e.g., JSON).

  • When prompt engineering isn't sufficient.


To summarize:
  1. Does GPT-3 need fine-tuning?

    Usually no --- few-shot/zero-shot learning is enough for most tasks.

  2. Does GPT-3 support fine-tuning?
    Yes, especially useful for domain-specific or high-requirement tasks.

相关推荐
背太阳的牧羊人24 分钟前
tokenizer.encode_plus,BERT类模型 和 Sentence-BERT 他们之间的区别与联系
人工智能·深度学习·bert
学算法的程霖28 分钟前
TGRS | FSVLM: 用于遥感农田分割的视觉语言模型
人工智能·深度学习·目标检测·机器学习·计算机视觉·自然语言处理·遥感图像分类
博睿谷IT99_1 小时前
华为HCIP-AI认证考试版本更新通知
人工智能·华为
一点.点2 小时前
SafeDrive:大语言模型实现自动驾驶汽车知识驱动和数据驱动的风险-敏感决策——论文阅读
人工智能·语言模型·自动驾驶
concisedistinct2 小时前
如何评价大语言模型架构 TTT ?模型应不应该永远“固定”在推理阶段?模型是否应当在使用时继续学习?
人工智能·语言模型·大模型
找了一圈尾巴2 小时前
AI Agent-基础认知与架构解析
人工智能·ai agent
jzwei0232 小时前
Transformer Decoder-Only 参数量计算
人工智能·深度学习·transformer
小言Ai工具箱2 小时前
PuLID:高效的图像变脸,可以通过文本提示编辑图像,通过指令修改人物属性,个性化文本到图像生成模型,支持AI变脸!艺术创作、虚拟形象定制以及影视制作
图像处理·人工智能·计算机视觉
白熊1882 小时前
【计算机视觉】基于深度学习的实时情绪检测系统:emotion-detection项目深度解析
人工智能·深度学习·计算机视觉
TextIn智能文档云平台2 小时前
PDF文档解析新突破:图表识别、公式还原、手写字体处理,让AI真正读懂复杂文档!
图像处理·人工智能·算法·自然语言处理·pdf·ocr