A Brief History: from GPT-1 to GPT-3

This is my reading notes of 《Developing Apps with GPT-4 and ChatGPT》.

In this section, we will introduce the evolution of the OpenAI GPT medels from GPT-1 to GPT-4.

GPT-1

In mid-2018, OpenAI published a paper titled "Improving Language Understanding by Generative Pre-Training" by Radford, Alec, et al. in which they introduced the Generative Pre-trained Transformer, also known as GPT-1.

The full name of GPT is Generative Pre-trained Transformer.

Before GPT-1, the common approach to building high-performance NLP neural models relied on supervised learning which needs large amounts of manually labeled data. However, the need for large amounts of well-annotated supervised data has limited the performance of these techniques because such datasets are both difficult and expensive to generate.

The authors of GPT-1 proposed a new learning process where an unsupervised pre-training step is introduced. In this step, no labeled data is needed.Instead, the model is trained to predict what the next token is.

The GPT-1 model used the BooksCorpus dataset for the pre-training which is a dataset containing the text of approximately 11,000 unpublished books.

In the unsupervised learning phase, the model learned to predict the next item in the texts of the BookCorpus dataset.

However, because the model is small, it was unable to perform complex tasks without fine-tuning.

To adapt the model to a specific target task, a second supervised learning step, called fine-tuning, was performed on a small set fo manually labeled data.

The process of fine-tuning allowed the parameters learned in the initial pre-training phase to be modified to fit the task at hand better.

In contrast to other NLP neural models, GPT-1 showed remarkable performance on several NLP tasks using only a small amount of manually labeled data for fine-tuning.

NOTE

GPT-1 was trained in two stages:


Stage 1: Unsupervised Pre-training
Goal: To learn general language patterns and presentations.
Method: The model is trained to predict the next token in the sentence.
Data: A large unlabeled text dataset
Type of Learning: Unsupervised learning -- no manual labels are needed.
Outcome: The model learns a strong general understanding of language, but it's not yet specialized for specific tasks(e.g., sentiment analysis or question answering)


Stage 2: Supervise Fine-tuning
Goal: To adapt the pre-trained model to a specific downstream task.
Method: The model is further trained on a small labeled dataset.
Type of Learning: Supervised learning -- the data includes input-output pairs(e.g., a sentence and its sentiment label).
Outcome: The model's parameters are fine-tuned so it performs better on that particular task.


Summary:
  • Pre-training teaches the model how language works(general knowledge).
  • Fine-tuning teaches the model how to perform a specific task(specialized skills).

A good analogy would be:

The model first read lots of books to be smart(pre-training), and then takes a short course to learn a particular job(fine-tuning).

The architecture of GPT-1 was a similar encoder from the original transformer, introduced in 2017, with 117 million parameters.

This first GPT model paved the way for future models with larger datasets and more parameters to take better advantage of the potential of the transformer architectures.

GPT-2

In early 2019, OpenAI proposed GPT-2, a scaled-up version of the GPT-1 model, increasing the number of parameters and the size of the training dataset tenfold.

The number of parameters of this new version was 1.5 billion, trained on 40 GB of text.

In November 2019, OpenAI released the full version of the GPT-2 language model.

GPT-2 is publicly available and can be downloaded from Huggingface or GitHub.

GPT-2 showed that training a larger language model on a larger dataset improves the ability of a language model to understand tasks and outperforms the state-of-art on many jobs.

GPT-3

GPT-3 was released by OpenAI in June 2020.

The main differences between GPT-2 and GPT-3 are the size of the model and the quantity of data used for the training.

GPT-3 is a much larger model, with 175 billion parameters, allowing it to capture more complex pattern.

In addition, GPT-3 is trained on a more extensive dataset.

This includes Common Crawl, a large web archive containing text from billions of web pages and other sources, such as Wikipedia.

This training dataset, which includes content from websites, books, and articles, allows GPT-3 to develop a deeper understanding of the language and context.

As a result, GPT-3 improved performance on a variety of linguistic tasks.

GPT-3 eliminates the need for a fine-tuning step that was mandatory for its predecessors.

NOTE

How GPT-3 eliminates the need for fine-tuning:

GPT-3 is trained on a massive amount of data, and it's much larger than GPT-1 and GPT-2 -- with 175 billion parameters.

Because of the scale, GPT-3 learns very strong general language skills during pre-training alone.


Instead of fine-tuning, GPT-3 uses:
  1. Zero-shot learning
    Just give it a task description in plain text -- no example needed.
  2. One-shot learning
    Give it one example in the prompt to show what kind of answer you want.
  3. Few-shot learning
    Give it a few examples in the prompt, and it learns the pattern on the fly.

So in short:

GPT-3 doesn't need fine-tuning because it can understand and adapt to new tasks just by seeing a few examples in the input prompt --- thanks to its massive scale and powerful pre-training.


GPT-3 is indeed capable of handling many tasks without traditional fine-tuning, but that doesn't mean it completely lacks support for or never uses fine-tuning.

GPT-3's default approach: Few-shot / Zero-shot Learning

What makes GPT-3 so impressive is that it can:

  • Perform tasks without retraining (fine-tuning)
  • Learn through prompts alone
Does GPT-3 support fine-tuning?

Yes! OpenAI eventually provided a fine-tuning API for GPT-3, which is useful in scenarios like:

  • When you have domain-specific data (e.g., legal, medical).

  • When you want the model to maintain a consistent tone or writing style.

  • When you need a stable and structured output format (e.g., JSON).

  • When prompt engineering isn't sufficient.


To summarize:
  1. Does GPT-3 need fine-tuning?

    Usually no --- few-shot/zero-shot learning is enough for most tasks.

  2. Does GPT-3 support fine-tuning?
    Yes, especially useful for domain-specific or high-requirement tasks.

相关推荐
牛奶25 分钟前
2026年大模型怎么选?前端人实用对比
前端·人工智能·ai编程
牛奶27 分钟前
前端人为什么要学AI?
前端·人工智能·ai编程
罗西的思考3 小时前
AI Agent框架探秘:拆解 OpenHands(10)--- Runtime
人工智能·算法·机器学习
冬奇Lab4 小时前
OpenClaw 源码精读(2):Channel & Routing——一条消息如何找到它的 Agent?
人工智能·开源·源码阅读
冬奇Lab4 小时前
一天一个开源项目(第38篇):Claude Code Telegram - 用 Telegram 远程用 Claude Code,随时随地聊项目
人工智能·开源·资讯
格砸5 小时前
从入门到辞职|从ChatGPT到OpenClaw,跟上智能时代的进化
前端·人工智能·后端
可观测性用观测云5 小时前
可观测性 4.0:教系统如何思考
人工智能
sunny8656 小时前
Claude Code 跨会话上下文恢复:从 8 次纠正到 0 次的工程实践
人工智能·开源·github
小笼包包仔6 小时前
OpenClaw 多Agent软件开发最佳实践指南
人工智能
smallyoung6 小时前
AgenticRAG:智能体驱动的检索增强生成
人工智能