如何让低于1B参数的小型语言模型实现 100% 的准确率

如何让低于1B参数的小型语言模型实现 100% 的准确率

上下文学习被低估了------ICL 是提升性能的秘密钥匙------教会 AI 说"我不知道"------第 2 部分

Fabio Matricardi

小型语言模型通往 100% 准确率之路

前言： 这篇文章主要面向人工智能模型的应用开发者，尤其是关注小型语言模型（Small Language Models, SLMs）的实践者。作者通过一系列真实任务测试和实用技巧，深入探讨了如何利用上下文学习（In-Context Learning, ICL），让参数量远低于10亿、甚至只有几千万的轻量模型，也能在未训练任务上实现接近甚至100% 的准确率。

文章旨在打破"大模型才能高性能"的固有观念，向开发者展示：通过精心设计的提示词与示例，小模型同样具备强大的推理与拒答能力，不仅计算成本更低，还适合部署在本地或移动设备上。对于希望构建高效、经济、可控的AI系统的开发者而言，这是一篇极具实践价值的参考指南。

有一种广为流传的（误）解，即要在语言模型中实现高准确率，需要庞大的计算资源和海量数据集。而这也意味着成本------而且成本可能高得令人望而却步，管理起来也极其复杂。

你可能听说过"越大越好"------只有拥有数十亿参数的庞大模型才能提供无懈可击的结果。但如果你一直以来被灌输的这一切其实都是一个谎言呢？

如果你可以用一个小到只占你硬盘 300MB 的模型实现完美的准确率呢？一个小到甚至可以在你的手机上超高速运行的模型呢？

许多研究人员、开发者和企业都受限于这些观念：也许你也认为，必须重金投入大规模模型，才能得到可靠的结果。这也是为什么很多人会选择使用 GPT-4o、Claude Sonnet 等 Web 聊天服务。

今天，我将向你展示，如何用 3.6 亿参数的模型，在它未曾训练过的任务上实现 100% 的准确率。确切地说，即使是经过指令微调的模型，在这些任务上都会失败。

但现在，不再是这样了。

做好准备......让我们开始吧！

安妮·弗兰克测试------连更大的模型都会失败------作者制作的信息图

安妮·弗兰克测试------又名"真实 RAG 任务"

在过去的几周里，你可能还记得，我一直在尝试解决小型语言模型的一个重大问题。它们虽然很优秀，但对提示的措辞甚至词序过于敏感。参数低于 10 亿时，你很可能会得到毫无意义或不一致的回答。

我的安妮·弗兰克测试就是一个例子，我把它归类为真实 RAG（Retrieval-Augmented Generation，检索增强生成）任务。测试方法是，让 LLM 仅依据提供的上下文作答------比如，我提供一段关于科技的文本，然后提问："安妮·弗兰克是谁？"

我期望生成式 AI 直接回答我"无法回答"，因为提供的文本中根本没有关于安妮·弗兰克的任何信息。事实是，几乎每个 LLM 都对安妮·弗兰克有所了解（因为它们的训练数据集中包含 Wikipedia）。结果就是，非常小的语言模型总是无法通过这个测试------这太遗憾了，因为问答任务可能是 LLM 最有用的功能之一。

来自官方 GitHub 仓库的视频

上下文学习被低估了

几天前，我再次偶然发现了 RWKV 模型（见上方视频）的惊人表现。开发者 Jellyfish042 训练了一个令人惊叹的 RWKV-7 模型（9M 和 26M 参数），它仅靠思维链推理（Chain of Thought）就能解决奥赛罗（Othello）棋局。

所以，一个超小型语言模型竟然能通过上下文学习（ICL）解决复杂任务。

那么......我们为什么不能做同样的事情呢？

但先让我们弄清楚一个概念：到底什么是"上下文学习"（ICL）？

什么是"上下文学习"（ICL）？

上下文学习（ICL）是一种强大的技术，它可以让语言模型仅根据提示输入的上下文直接执行任务，而无需额外训练或微调。

ICL 利用模型对语言和上下文的固有理解，从提供的提示中推断并执行任务。这种方法让小型模型也能达到高准确率，重点在于模型如何解读和利用上下文，而不是仅仅依赖模型的大小或海量数据集。

ICL 的原理是，在输入提示中直接嵌入任务指令和示例。这种方法大幅减少了对大规模训练数据、额外微调和计算资源的需求。

ICL 早已被证明有效

一篇早期但极具启发性的论文曾探讨上下文学习的潜力和影响，论文标题是：

📄 《Language Models are Few-Shot Learners》（语言模型是少样本学习者）

这篇论文研究了大语言模型如何在少样本学习（Few-shot Learning）场景中表现优异，而这正与 ICL 密切相关。作者指出，像 GPT-3 这样的模型，仅凭输入提示中的少量示例，就能理解并执行任务，展现出强大的上下文学习能力。

论文中的一句话值得深思：

"我们发现，扩大语言模型的规模能大幅提升任务无关的少样本学习性能，有时甚至能与之前的 SOTA 微调方法媲美。"

ICL 也适用于小型语言模型

虽然这篇论文重点关注的是 GPT-3 这样的大型模型，但利用上下文来执行任务的原则，同样适用于更小的模型。

如果你能有效利用 ICL，即使是小型模型也能取得惊人的成果。

来看 ICL 的实际效果（先用论文中的示例）

基于 Transformer 的预训练大语言模型（LLM）已经在各种 NLP 任务上取得了显著进展。随着这些 LLM 的规模增长，它们获得了"上下文学习"能力，即模型可以通过少量示例，在推理时达到 SOTA（最先进）或接近 SOTA 的表现，而无需更新模型参数。

下面是一个 ICL 在语义分析任务中的示例输入提示：

Great movie. Positive.

The worst movie ever. Negative.

Can't wait to see the second movie!

前两行是两个示例，第三行是一个测试输入。我们期望 LLM 作为续写，输出正确的标签"积极（Positive）"。

使用此方法生成的图片

SmolLM2--360-instruct 是个超强的学习者

我在一个月前就放弃了微调模型的想法：即使只是对较小的 SmolLM 模型（1.35 亿参数）运行几个 epoch，也需要一块 VRAM 相当可观的 GPU......但我没有。

于是，我开始用 Hugging Face 的这些模型测试上下文学习（ICL）的能力。第一个重要的发现是......

低于 3.6 亿参数的模型不可靠

至少对于 SmolLM2 系列来说是这样。我没能让 SmolLM2--135M 通过 ICL 学习任务。

但 3.6 亿参数的模型却成功了。而方法如下！

我们将使用本地 Gradio Chatbot 作为接口，运行 GGUF 模型，并通过 llama.cpp 服务器提供服务。整个教程在我之前的文章中已经详细说明：

注意：SmolLM2 系列的上下文窗口为 8192 tokens，这意味着我们有足够的上下文容量来进行自动事实核查，或准备精心策划的问答部分。

提供正确的示例

关键在于写出适量的示例（包含两种情况），这样模型才能从提示中学习。在我们的案例中，我们需要一些示例直接回答问题，也需要一些示例回答 "无法回答（unanswerable）"。

提示：你甚至可以使用更大的模型（本地或在线）来定制你的提示示例。例如，在 Jack Reacher 的示例中，我使用了新的 qwen Web 应用（稍后你会在提示中看到）。你甚至可以让大模型自动生成完整的 ICL 提示及示例！

@Alibaba_Qwen 最近推出了它的 AI Web 应用，支持 HTML 渲染、文件上传等。它具备 GPT-4o 和 Claude 的主要功能。如果你没有强大的 GPU，通常需要一个更大的 LLM 来测试潜在的提示，或直接为 ICL 生成合成数据。你可以免费登录并进行测试：[点击这里尝试](Try it here)。

使用大模型准备你的示例

因此，ICL 提示应该包含类似以下的少样本示例（few-shot examples）：

Task: reply to the user question using a provided context and

say "unanswerable" when the context does not have the information required.

Examples:

question: who is Jack Reacher?

context\] Jack Reacher is a fictional character created by British author Lee Child. He is the protagonist of a popular series of thriller novels, which began with "Killing Floor" in 1997. Jack Reacher is a former Major in the United States Army Military Police Corps who now lives as a drifter and troubleshooter. He roams the United States without a permanent home or job, often finding himself involved in solving crimes or righting wrongs. Jack Reacher's stories are known for their suspenseful plots, action-packed sequences, and complex characters, making him one of the most iconic figures in contemporary thriller fiction.\[end of context

Remember: If the answer is not contained in the text say "unanswerable"

and explain why you cannot answer.

answer: Jack Reacher is a fictional character created by British author

Lee Child. He is the protagonist of a popular series of thriller novels,

which began with "Killing Floor" in 1997. Jack Reacher is a former Major

in the United States Army Military Police Corps. He roams the United States

often finding himself involved in solving crimes or righting wrongs.

Jack Reacher is considered one of the most iconic figures in contemporary

thriller fiction.

我想提醒你注意几个方面：

问题/上下文/答案的顺序：小语言模型对词语顺序非常敏感，很小的改动经常会导致完全不同的结果。
仅有解码器的模型在生成过程中是自回归的：否则它们无法"回头看"，所以在提示词的语义空间中，最好提前给出问题，这样注意力机制才能正常工作，不会丢失问题而忘了上下文。
使用开始/结束标签：对大模型（参数超过30亿）来说可能不是必须的，但对小语言模型来说，使用标签来标识一个部分的开始和结束是关键点。对于ICL来说更是如此，因为我们是在教模型如何识别一个模式！

**给出足够数量的示例

**只给一个示例不够教会模型该怎么做。如果我们只给出"无法回答的示例"，模型会以为任务就是回答"无法回答"。如果我们只给出正面的示例，我们就没有教会模型如何执行这个任务。

question: what is Science?

context\] Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few inputoutput examples ("shots") provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. We find that both Reinforced and Unsupervised ICL can be effective in the many-shot regime, particularly on complex reasoning tasks. Furthermore, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning.\[end of context

Remember: If the answer is not contained in the text say "unanswerable"

and explain why you cannot answer.

answer: unanswerable. The provided text discusses Large Language Models (LLMs)

and their capabilities, but it does not provide any information about the

definition of science.

这是另一个例子，期望的回答是"无法回答"。既然我们是想正确地教会它，我的看法是四个例子就够了！

模型会从上下文中学到每种可能结果的两种情况（一个是与上下文匹配的有效回答，另一个是"无法回答"的回答，并附上其理由的解释）。

Task: reply to the user question using a provided context and

say "unanswerable" when the context does not have the information required.

Examples:

question: who is Jack Reacher?

context\] Jack Reacher is a fictional character created by British author Lee Child. He is the protagonist of a popular series of thriller novels, which began with "Killing Floor" in 1997. Jack Reacher is a former Major in the United States Army Military Police Corps who now lives as a drifter and troubleshooter. He roams the United States without a permanent home or job, often finding himself involved in solving crimes or righting wrongs. Jack Reacher's stories are known for their suspenseful plots, action-packed sequences, and complex characters, making him one of the most iconic figures in contemporary thriller fiction.\[end of context

Remember: If the answer is not contained in the text say "unanswerable"

and explain why you cannot answer.

answer: Jack Reacher is a fictional character created by British author

Lee Child. He is the protagonist of a popular series of thriller novels,

which began with "Killing Floor" in 1997. Jack Reacher is a former Major

in the United States Army Military Police Corps. He roams the United States

often finding himself involved in solving crimes or righting wrongs.

Jack Reacher is considered one of the most iconic figures in contemporary

thriller fiction.

question: what is Science?

context\] Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few inputoutput examples ("shots") provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. We find that both Reinforced and Unsupervised ICL can be effective in the many-shot regime, particularly on complex reasoning tasks. Furthermore, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning.\[end of context

Remember: If the answer is not contained in the text say "unanswerable"

and explain why you cannot answer.

answer: unanswerable. The provided text discusses Large Language Models (LLMs)

and their capabilities, but it does not provide any information about the

definition of science.

<< here other 2 examples, and finally... >>

Now reply to the user question:

question: who is Anne Frank?

context\] Albert Einstein is credited with formulating the law of universal gravitation. Einstein's law of universal gravitation states that every particle attracts every other particle in the universe with a force that is directly proportional to the product of their masses and inversely proportional to the square of the distance between their centers.\[end of context

Remember: If the answer is not contained in the text say "unanswerable" and

explain why you cannot answer.

answer:

无法回答 ------ 我们成功了！

这是完整的提示词，没有换行（虽然不易读，但对生成式AI模型来说更好......）

Task: reply to the user question using a provided context and say "unanswerable" when the context does not have the information required.

Examples:

question: who is Jack Reacher?

context\] Jack Reacher is a fictional character created by British author Lee Child. He is the protagonist of a popular series of thriller novels, which began with "Killing Floor" in 1997. Jack Reacher is a former Major in the United States Army Military Police Corps who now lives as a drifter and troubleshooter. He roams the United States without a permanent home or job, often finding himself involved in solving crimes or righting wrongs. Jack Reacher's stories are known for their suspenseful plots, action-packed sequences, and complex characters, making him one of the most iconic figures in contemporary thriller fiction.\[end of context

Remember: If the answer is not contained in the text say "unanswerable" and explain why you cannot answer.

answer: Jack Reacher is a fictional character created by British author Lee Child. He is the protagonist of a popular series of thriller novels, which began with "Killing Floor" in 1997. Jack Reacher is a former Major in the United States Army Military Police Corps. He roams the United States often finding himself involved in solving crimes or righting wrongs. Jack Reacher is considered one of the most iconic figures in contemporary thriller fiction.

question: what is Science?

context\] Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few inputoutput examples ("shots") provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. We find that both Reinforced and Unsupervised ICL can be effective in the many-shot regime, particularly on complex reasoning tasks. Furthermore, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning.\[end of context

Remember: If the answer is not contained in the text say "unanswerable" and explain why you cannot answer.

answer: unanswerable. The provided text discusses Large Language Models (LLMs) and their capabilities, but it does not provide any information about the definition of science.

question: what is skill-of-mind?

context\]To increase social bonding with interlocutors, humans naturally acquire the ability to respond appropriately in a given situation by considering which conversational skill is most suitable for the response --- a process we call skill-of-mind.\[end of context

Remember: If the answer is not contained in the text say "unanswerable" and explain why you cannot answer.

answer: Skill-of-mind refers to the natural human ability to respond appropriately in a given situation by selecting and utilizing the most suitable conversational skill.

question: what is Bill Clinton policy?

context\] Today, I will share more details about Russia's collapsing currency. Russia is not headed for stagflation. They are on the path towards bankruptcy. The Bank of Russia will not buy any foreign currency until the end of the year. In August 2023, there was a similar announcement. Back then, the ruble was also worth less than one cent. The Bank of Russia is the only player in this market. India and China both refuse to take payments in rubles. These developments will continue to have adverse effects on Russian food prices, energy exports, and other areas of their economy. Russian imports will become more expensive, up and down the chain, and the inflation inside Russia will get much worse.\[end of context

Remember: If the answer is not contained in the text say "unanswerable" and explain why you cannot answer.

answer: unanswerable. The provided text does not contain any information about Bill Clinton's policies. It focuses on Russia's economic situation.

Now reply to the user question:

question: who is Anne Frank?

context\] Albert Einstein is credited with formulating the law of universal gravitation. Einstein's law of universal gravitation states that every particle attracts every other particle in the universe with a force that is directly proportional to the product of their masses and inversely proportional to the square of the distance between their centers.\[end of context

Remember: If the answer is not contained in the text say "unanswerable" and

explain why you cannot answer.

answer:

来自我以前发布文章的图片

额外的澄清 ------ 紧贴上下文

就像我在上一篇文章《Stick to the truth my dear AI》中提到的那样，紧贴上下文也意味着你会遇到一些看起来矛盾的情境，就像 Nikhil Anand 在他的故事中所期望的那样。

我用ICL尝试了一个类似的测试，使用的提示词如下：

Now reply to the user question:

context\] Albert Einstein invented gravity in 1905.\[end of context

question: who invented gravity?

answer:

而且嘛，不用 logits 和 PyTorch 我也确认了，一个拥有3.6亿参数的模型是无法完成这类请求的：事实上，DPO和模型对齐机制真的很强。我试了用 Llama-SmolTalk-3.2--1B-Instruct.Q6_K.gguf（它是在和 SmolLM2 系列相同的数据集上微调的......），得到的回复是：

看起来它不能说谎 ------ Llama-SmolTalk-3.2--1B-Instruct.Q6_K.gguf

就连 qwen2.5--1.5b-instruct 也做不到。

qwen2.5--1.5b-instruct 真相测试

但稍微大一点的模型，比如 Gemma-2--2B 就能做到。

gemma-2--2b-it-Q5_K_M.gguf 展示中

总结一下：即便是像 SmolLM2--360M-Instruct 这样的小模型，也可以从 In-context Learning 中学习，但它仍然无法完全绕开常识和DPO机制。

至少你知道了。

**结论

**In-context Learning 是一个非常强大的技术：现在大多数最新的小语言模型至少有 8k 的上下文窗口。这意味着，除了大约400个token用来放ICL提示词之外，你还有约7000个token可以用于上下文和问答。

Llama-3.2 系列的小语言文本生成模型的上下文窗口甚至扩展到了128k token。

但正如你所看到的，拥有巨大的上下文容量，并不总是意味着模型就能百分百准确。