采用 guidance 提高大模型输出的可靠性和稳定性

本文首发于博客 LLM 应用开发实践

在复杂的 LLM 应用开发中，特别涉及流程编排和多次 LLM 调用时，每次的 Prompt 设计都取决于前一个步骤的大模型输出。如何避免大语言模型的"胡说八道"，以提高大语言模型输出的可靠性和稳定性，成为一个具有挑战性的问题。在开发应用的过程中，我发现了微软推出的开源项目 guidance，能够很好地解决这一繁琐问题，本篇文章对此进行详细说明。

场景说明

首先分享下实际遇到的问题，我在做一个科普类视频内容纠正小工具，大概流程就是从视频中提取关键概念，并调用维基百科进行交叉验证：

解析科普视频字幕内容
让 LLM 分析是否存在错误科普片段
从错误科普片段上下文中提取相关概念
调用维基百科做纠正
生成一篇纠正性文章

下面是我一部分的 Prompt（提示词）设计示例：

复制代码

Please act as an encyclopaedic expert covering the fields of physics, mathematics, chemistry and biology. The captioned content of a science video will be provided below. Please ensure that you fully understand the content of the video and then correct any scientific errors in it from a professional point of view. The content of the subtoc: true
titles of the video to be analysed is as follows:```{context}```
Your return must be in the specified json format, with the special character backslash \ escaped, always make sure that the json format cannot be wrong, and the content must be in English, like the following:
{
  "Misconception 1": "Relevant context error content 1",
  "Misconception 2": "Relevant context error content 1",
  ...
}

针对语言模型返回的内容，首先进行 json 解析，如果出错，再次请求（重复 3 次）；如果解析正常，转换为字典进行遍历，将维基百科搜索的内容结合错误片段组成 Prompt，让大语言模型生成一篇纠正性文章。

发现问题

即使在 Prompt 中强调了语言模型返回 json 格式，但是实际调用过程中还会有 20 %的概率返回的不是 json 格式，只能通过重试规避，但是重试会再次大量消耗 token，肯定不是一个可行的方案。所以我在想是否可以做一个类似的工具，将上述过程（检查返回结果+生成错误信息）进行封装，且发生错误时只将解析错误的部分内容告知 LLM（节省 token），进行下一次的生成，不断重复直到符合要求，然后发现了 guidance，完美契合了我的需求，这篇文章将详细介绍这个工具。

guidance

guidance 是一个 Python 库，相比提示词方式或链式调用方式，可以更有效地控制和利用大型语言模型(如 GPT、BART 等)。简单直观的语法，基于 Handlebars 模板，丰富的输出结构，具有函数调用、逻辑判断、控制流等功能，它的主要作用和优点包括:

简化输出结构设计

通过模板语法可以设计各种输出结构逻辑:
python 复制代码
```
{{#if}}...{{else}}...{{/if}}
{{#each}}...{{/each}}
```
插入生成文本（遇到 gen 关键字，请求 LLM，获得响应后，继续解析语法树）：
python 复制代码
```
{{gen "变量名"}}
```

选择最佳选项:

python 复制代码

{{#select "变量名"}}选项1{{or}}选项2{{/select}}

推理加速

与单次生成相比，guidance 可以自动把已经生成过的结果缓存起来，提升速度。

支持聊天对话

python 复制代码

{{#user}}...{{/user}}
{{#assistant}}...{{/assistant}}

保证特定语法格式

guidance 可以通过正则表达式指导语言模型生成保证语法正确的文本，例如生成 JSON 对象:
python 复制代码
```
{
"name": "{{gen "name"}}",
"age": "{{gen "age"}}"
}
```
消除 token 边界效应

所谓 token 边界效应会导致语言模型在生成文本时产生非预期的停止，guidance 通过一种叫"token healing"的方法可以消除这种效应，使用{``{gen token_healing=True}}即可开启。

集成 Transformer，

python 复制代码

from guidance.llms import Transformers

llm = Transformers("gpt2")
guidance(llm=llm)

实时流式传输

guidance 具有明确定义的线性执行顺序，该顺序直接对应于大语言模型处理 token 的顺序。在执行过程中的任何时候，大语言模型都可用于生成文本（当调用到{``{gen}}命令时，便会触发 LLM 的生成操作）或做出逻辑控制流决策，允许进行精确的输出结构设计，从而产生清晰可解析的结果。

python 复制代码

import guidance
guidance.llm = guidance.llms.OpenAI("text-davinci-003")

program = guidance("""Tweak this proverb to apply to model instructions instead.

{{proverb}}
- {{book}} {{chapter}}:{{verse}}

UPDATED
Where there is no guidance{{gen 'rewrite' stop="\\n-"}}
- GPT {{#select 'chapter'}}9{{or}}10{{or}}11{{/select}}:{{gen 'verse'}}""")

executed_program = program(
    proverb="Where there is no guidance, a people falls,\nbut in an abundance of counselors there is safety.",
    book="Proverbs",
    chapter=11,
    verse=14
)

程序执行后，所有生成的变量都可以轻松访问：

复制代码

>> executed_program["rewrite"]
>> ', a model fails,\nbut in an abundance of instructions there is safety.'

聊天对话模式

通过基于角色标记（如 {``{#system}}...{``{/system}} ）的统一 API，guidance 支持 GPT-4 等基于 API 的聊天模型，以及 Vicuna 等开源聊天模型。

python 复制代码

gpt4 = guidance.llms.OpenAI("gpt-4")
# vicuna = guidance.llms.transformers.Vicuna("your_path/vicuna_13B", device_map="auto")
experts = guidance('''
{{#system~}}
You are a helpful and terse assistant.
{{~/system}}

{{#user~}}
I want a response to the following question:
{{query}}
Name 3 world-class experts (past or present) who would be great at answering this?
Don't answer the question yet.
{{~/user}}

{{#assistant~}}
{{gen 'expert_names' temperature=0 max_tokens=300}}
{{~/assistant}}

{{#user~}}
Great, now please answer the question as if these experts had collaborated in writing a joint anonymous answer.
{{~/user}}

{{#assistant~}}
{{gen 'answer' temperature=0 max_tokens=500}}
{{~/assistant}}
''', llm=gpt4)

experts(query='How can I be more productive?')

加速推理

python 复制代码

# we use LLaMA here, but any GPT-style model will do
llama = guidance.llms.Transformers("your_path/llama-7b", device=0)

# we can pre-define valid option sets
valid_weapons = ["sword", "axe", "mace", "spear", "bow", "crossbow"]

# define the prompt
character_maker = guidance("""The following is a character profile for an RPG game in JSON format.
```json
{
    "id": "{{id}}",
    "description": "{{description}}",
    "name": "{{gen 'name'}}",
    "age": {{gen 'age' pattern='[0-9]+' stop=','}},
    "armor": "{{#select 'armor'}}leather{{or}}chainmail{{or}}plate{{/select}}",
    "weapon": "{{select 'weapon' options=valid_weapons}}",
    "class": "{{gen 'class'}}",
    "mantra": "{{gen 'mantra' temperature=0.7}}",
    "strength": {{gen 'strength' pattern='[0-9]+' stop=','}},
    "items": [{{#geneach 'items' num_iterations=5 join=', '}}"{{gen 'this' temperature=0.7}}"{{/geneach}}]
}```""")

# generate a character
character_maker(
    id="e1f491f7-7ab8-4dac-8c20-c92b5e7d883d",
    description="A quick and nimble fighter.",
    valid_weapons=valid_weapons,
    llm=llama
)

按照我之前的做法整个 json 都需要由 LLM 来生成，guidance 的思路是，既然 json 的结构是预先定义的，那么字段声明，花括号等等，其实都不需要 LLM 来生成。这个示例中，蓝色部分是传入的变量，只有绿色部分才是真正调用了 LLM 来生成的。这样一方面保证了生成的 json 结构可控，不会出现格式错误，字段缺失等，一方面通过 LLM 生成的 token 数量减少了，节省成本，加速推理。

回顾

guidance 本质上是一种用于处理大语言模型交互的领域特定语言（DSL），和大语言模型查询语言一样，旨在降低 LLM 交互的成本。guidance 可以加快推理速度，又可以确保生成的 json 始终有效，有效的提高了 LLM （大语言模型）输出的可靠性和稳定性。

更多内容在公号：LLM 应用全栈开发