A Brief History: from GPT-1 to GPT-3

This is my reading notes of 《Developing Apps with GPT-4 and ChatGPT》.

In this section, we will introduce the evolution of the OpenAI GPT medels from GPT-1 to GPT-4.

GPT-1

In mid-2018, OpenAI published a paper titled "Improving Language Understanding by Generative Pre-Training" by Radford, Alec, et al. in which they introduced the Generative Pre-trained Transformer, also known as GPT-1.

The full name of GPT is Generative Pre-trained Transformer.

Before GPT-1, the common approach to building high-performance NLP neural models relied on supervised learning which needs large amounts of manually labeled data. However, the need for large amounts of well-annotated supervised data has limited the performance of these techniques because such datasets are both difficult and expensive to generate.

The authors of GPT-1 proposed a new learning process where an unsupervised pre-training step is introduced. In this step, no labeled data is needed.Instead, the model is trained to predict what the next token is.

The GPT-1 model used the BooksCorpus dataset for the pre-training which is a dataset containing the text of approximately 11,000 unpublished books.

In the unsupervised learning phase, the model learned to predict the next item in the texts of the BookCorpus dataset.

However, because the model is small, it was unable to perform complex tasks without fine-tuning.

To adapt the model to a specific target task, a second supervised learning step, called fine-tuning, was performed on a small set fo manually labeled data.

The process of fine-tuning allowed the parameters learned in the initial pre-training phase to be modified to fit the task at hand better.

In contrast to other NLP neural models, GPT-1 showed remarkable performance on several NLP tasks using only a small amount of manually labeled data for fine-tuning.

NOTE

GPT-1 was trained in two stages:


Stage 1: Unsupervised Pre-training
Goal: To learn general language patterns and presentations.
Method: The model is trained to predict the next token in the sentence.
Data: A large unlabeled text dataset
Type of Learning: Unsupervised learning -- no manual labels are needed.
Outcome: The model learns a strong general understanding of language, but it's not yet specialized for specific tasks(e.g., sentiment analysis or question answering)


Stage 2: Supervise Fine-tuning
Goal: To adapt the pre-trained model to a specific downstream task.
Method: The model is further trained on a small labeled dataset.
Type of Learning: Supervised learning -- the data includes input-output pairs(e.g., a sentence and its sentiment label).
Outcome: The model's parameters are fine-tuned so it performs better on that particular task.


Summary:
  • Pre-training teaches the model how language works(general knowledge).
  • Fine-tuning teaches the model how to perform a specific task(specialized skills).

A good analogy would be:

The model first read lots of books to be smart(pre-training), and then takes a short course to learn a particular job(fine-tuning).

The architecture of GPT-1 was a similar encoder from the original transformer, introduced in 2017, with 117 million parameters.

This first GPT model paved the way for future models with larger datasets and more parameters to take better advantage of the potential of the transformer architectures.

GPT-2

In early 2019, OpenAI proposed GPT-2, a scaled-up version of the GPT-1 model, increasing the number of parameters and the size of the training dataset tenfold.

The number of parameters of this new version was 1.5 billion, trained on 40 GB of text.

In November 2019, OpenAI released the full version of the GPT-2 language model.

GPT-2 is publicly available and can be downloaded from Huggingface or GitHub.

GPT-2 showed that training a larger language model on a larger dataset improves the ability of a language model to understand tasks and outperforms the state-of-art on many jobs.

GPT-3

GPT-3 was released by OpenAI in June 2020.

The main differences between GPT-2 and GPT-3 are the size of the model and the quantity of data used for the training.

GPT-3 is a much larger model, with 175 billion parameters, allowing it to capture more complex pattern.

In addition, GPT-3 is trained on a more extensive dataset.

This includes Common Crawl, a large web archive containing text from billions of web pages and other sources, such as Wikipedia.

This training dataset, which includes content from websites, books, and articles, allows GPT-3 to develop a deeper understanding of the language and context.

As a result, GPT-3 improved performance on a variety of linguistic tasks.

GPT-3 eliminates the need for a fine-tuning step that was mandatory for its predecessors.

NOTE

How GPT-3 eliminates the need for fine-tuning:

GPT-3 is trained on a massive amount of data, and it's much larger than GPT-1 and GPT-2 -- with 175 billion parameters.

Because of the scale, GPT-3 learns very strong general language skills during pre-training alone.


Instead of fine-tuning, GPT-3 uses:
  1. Zero-shot learning
    Just give it a task description in plain text -- no example needed.
  2. One-shot learning
    Give it one example in the prompt to show what kind of answer you want.
  3. Few-shot learning
    Give it a few examples in the prompt, and it learns the pattern on the fly.

So in short:

GPT-3 doesn't need fine-tuning because it can understand and adapt to new tasks just by seeing a few examples in the input prompt --- thanks to its massive scale and powerful pre-training.


GPT-3 is indeed capable of handling many tasks without traditional fine-tuning, but that doesn't mean it completely lacks support for or never uses fine-tuning.

GPT-3's default approach: Few-shot / Zero-shot Learning

What makes GPT-3 so impressive is that it can:

  • Perform tasks without retraining (fine-tuning)
  • Learn through prompts alone
Does GPT-3 support fine-tuning?

Yes! OpenAI eventually provided a fine-tuning API for GPT-3, which is useful in scenarios like:

  • When you have domain-specific data (e.g., legal, medical).

  • When you want the model to maintain a consistent tone or writing style.

  • When you need a stable and structured output format (e.g., JSON).

  • When prompt engineering isn't sufficient.


To summarize:
  1. Does GPT-3 need fine-tuning?

    Usually no --- few-shot/zero-shot learning is enough for most tasks.

  2. Does GPT-3 support fine-tuning?
    Yes, especially useful for domain-specific or high-requirement tasks.

相关推荐
诸葛箫声2 小时前
基于PyTorch的CIFAR-10图像分类项目总结
人工智能·pytorch·分类
en-route2 小时前
从零开始学神经网络——GRU(门控循环单元)
人工智能·深度学习·gru
说私域2 小时前
基于开源AI大模型AI智能名片S2B2C商城小程序的产地优势产品营销策略研究
人工智能·小程序·开源
说私域2 小时前
蒸汽机革命后工业生产方式的变革与AI智能名片S2B2C商城小程序的影响
大数据·人工智能·小程序
MongoVIP2 小时前
AI提示词应用
人工智能·职场和发展·简历优化·简历制作
深圳UMI2 小时前
AI笔记在学习与工作中的高效运用
大数据·人工智能
大模型真好玩3 小时前
深入浅出LangGraph AI Agent智能体开发教程(八)—LangGraph底层API实现ReACT智能体
人工智能·agent·deepseek
IT_陈寒3 小时前
告别低效!用这5个Python技巧让你的数据处理速度提升300% 🚀
前端·人工智能·后端
北京耐用通信3 小时前
神秘魔法?耐达讯自动化Modbus TCP 转 Profibus 如何为光伏逆变器编织通信“天网”
网络·人工智能·网络协议·网络安全·自动化·信息与通信
居7然3 小时前
如何高效微调大模型?LLama-Factory一站式解决方案全解析
人工智能·大模型·llama·大模型训练·vllm