A Brief History: from GPT-1 to GPT-3

This is my reading notes of 《Developing Apps with GPT-4 and ChatGPT》.

In this section, we will introduce the evolution of the OpenAI GPT medels from GPT-1 to GPT-4.

GPT-1

In mid-2018, OpenAI published a paper titled "Improving Language Understanding by Generative Pre-Training" by Radford, Alec, et al. in which they introduced the Generative Pre-trained Transformer, also known as GPT-1.

The full name of GPT is Generative Pre-trained Transformer.

Before GPT-1, the common approach to building high-performance NLP neural models relied on supervised learning which needs large amounts of manually labeled data. However, the need for large amounts of well-annotated supervised data has limited the performance of these techniques because such datasets are both difficult and expensive to generate.

The authors of GPT-1 proposed a new learning process where an unsupervised pre-training step is introduced. In this step, no labeled data is needed.Instead, the model is trained to predict what the next token is.

The GPT-1 model used the BooksCorpus dataset for the pre-training which is a dataset containing the text of approximately 11,000 unpublished books.

In the unsupervised learning phase, the model learned to predict the next item in the texts of the BookCorpus dataset.

However, because the model is small, it was unable to perform complex tasks without fine-tuning.

To adapt the model to a specific target task, a second supervised learning step, called fine-tuning, was performed on a small set fo manually labeled data.

The process of fine-tuning allowed the parameters learned in the initial pre-training phase to be modified to fit the task at hand better.

In contrast to other NLP neural models, GPT-1 showed remarkable performance on several NLP tasks using only a small amount of manually labeled data for fine-tuning.

NOTE

GPT-1 was trained in two stages:


Stage 1: Unsupervised Pre-training
Goal: To learn general language patterns and presentations.
Method: The model is trained to predict the next token in the sentence.
Data: A large unlabeled text dataset
Type of Learning: Unsupervised learning -- no manual labels are needed.
Outcome: The model learns a strong general understanding of language, but it's not yet specialized for specific tasks(e.g., sentiment analysis or question answering)


Stage 2: Supervise Fine-tuning
Goal: To adapt the pre-trained model to a specific downstream task.
Method: The model is further trained on a small labeled dataset.
Type of Learning: Supervised learning -- the data includes input-output pairs(e.g., a sentence and its sentiment label).
Outcome: The model's parameters are fine-tuned so it performs better on that particular task.


Summary:
  • Pre-training teaches the model how language works(general knowledge).
  • Fine-tuning teaches the model how to perform a specific task(specialized skills).

A good analogy would be:

The model first read lots of books to be smart(pre-training), and then takes a short course to learn a particular job(fine-tuning).

The architecture of GPT-1 was a similar encoder from the original transformer, introduced in 2017, with 117 million parameters.

This first GPT model paved the way for future models with larger datasets and more parameters to take better advantage of the potential of the transformer architectures.

GPT-2

In early 2019, OpenAI proposed GPT-2, a scaled-up version of the GPT-1 model, increasing the number of parameters and the size of the training dataset tenfold.

The number of parameters of this new version was 1.5 billion, trained on 40 GB of text.

In November 2019, OpenAI released the full version of the GPT-2 language model.

GPT-2 is publicly available and can be downloaded from Huggingface or GitHub.

GPT-2 showed that training a larger language model on a larger dataset improves the ability of a language model to understand tasks and outperforms the state-of-art on many jobs.

GPT-3

GPT-3 was released by OpenAI in June 2020.

The main differences between GPT-2 and GPT-3 are the size of the model and the quantity of data used for the training.

GPT-3 is a much larger model, with 175 billion parameters, allowing it to capture more complex pattern.

In addition, GPT-3 is trained on a more extensive dataset.

This includes Common Crawl, a large web archive containing text from billions of web pages and other sources, such as Wikipedia.

This training dataset, which includes content from websites, books, and articles, allows GPT-3 to develop a deeper understanding of the language and context.

As a result, GPT-3 improved performance on a variety of linguistic tasks.

GPT-3 eliminates the need for a fine-tuning step that was mandatory for its predecessors.

NOTE

How GPT-3 eliminates the need for fine-tuning:

GPT-3 is trained on a massive amount of data, and it's much larger than GPT-1 and GPT-2 -- with 175 billion parameters.

Because of the scale, GPT-3 learns very strong general language skills during pre-training alone.


Instead of fine-tuning, GPT-3 uses:
  1. Zero-shot learning
    Just give it a task description in plain text -- no example needed.
  2. One-shot learning
    Give it one example in the prompt to show what kind of answer you want.
  3. Few-shot learning
    Give it a few examples in the prompt, and it learns the pattern on the fly.

So in short:

GPT-3 doesn't need fine-tuning because it can understand and adapt to new tasks just by seeing a few examples in the input prompt --- thanks to its massive scale and powerful pre-training.


GPT-3 is indeed capable of handling many tasks without traditional fine-tuning, but that doesn't mean it completely lacks support for or never uses fine-tuning.

GPT-3's default approach: Few-shot / Zero-shot Learning

What makes GPT-3 so impressive is that it can:

  • Perform tasks without retraining (fine-tuning)
  • Learn through prompts alone
Does GPT-3 support fine-tuning?

Yes! OpenAI eventually provided a fine-tuning API for GPT-3, which is useful in scenarios like:

  • When you have domain-specific data (e.g., legal, medical).

  • When you want the model to maintain a consistent tone or writing style.

  • When you need a stable and structured output format (e.g., JSON).

  • When prompt engineering isn't sufficient.


To summarize:
  1. Does GPT-3 need fine-tuning?

    Usually no --- few-shot/zero-shot learning is enough for most tasks.

  2. Does GPT-3 support fine-tuning?
    Yes, especially useful for domain-specific or high-requirement tasks.

相关推荐
lxmyzzs16 分钟前
【图像算法 - 21】慧眼识虫:基于深度学习与OpenCV的农田害虫智能识别系统
人工智能·深度学习·opencv·算法·yolo·目标检测·计算机视觉
Gloria_niki17 分钟前
机器学习之K 均值聚类算法
人工智能·机器学习
AI人工智能+24 分钟前
表格识别技术:通过图像处理与深度学习,将非结构化表格转化为可编辑结构化数据,推动智能化发展
人工智能·深度学习·ocr·表格识别
深圳多奥智能一卡(码、脸)通系统37 分钟前
智能二维码QR\刷IC卡\人脸AI识别梯控系统功能设计需基于模块化架构,整合物联网、生物识别、权限控制等技术,以下是多奥分层次的系统设计框架
人工智能·门禁·电梯门禁·二维码梯控·梯控·电梯
批量小王子39 分钟前
2025-08-19利用opencv检测图片中文字及图片的坐标
人工智能·opencv·计算机视觉
没有梦想的咸鱼185-1037-16632 小时前
SWMM排水管网水力、水质建模及在海绵与水环境中的应用
数据仓库·人工智能·数据挖掘·数据分析
即兴小索奇2 小时前
【无标题】
人工智能·ai·商业·ai商业洞察·即兴小索奇
国际学术会议-杨老师2 小时前
2025年计算机视觉与图像国际会议(ICCVI 2025)
人工智能·计算机视觉
欧阳小猜2 小时前
深度学习②【优化算法(重点!)、数据获取与模型训练全解析】
人工智能·深度学习·算法
fsnine3 小时前
深度学习——神经网络
人工智能·深度学习·神经网络