LangChain中的output parsers

生成语言模型由于生成不可控，所以生成的自然语言是非结构话的文本。而prompt的出现使得用户可以将期望的输出文本格式进行约束和规范。LangChain中的output parsers模块可以使模型输出的期望的结构化文本，使用的正是prompt操作。

LangChain中的output parsers一共有七个，分别是List parser、Datetime parser、Enum parser、Pydantic (JSON) parser、Structured output parser、Retry parser、Auto-fixing parser。前四种parser用于常见的格式输出转换，Structured output parser用于多字段输出时使用，最后的两种是对于格式转换失败之后的修复措施。

List parser

想要实现一个输出转换器，只要实现两个必要方法和一个可选方法。两个必要方法分别是一个格式说明方法和一个格式解析方法。方法的作用顾名思义，格式说明就是prompt，告诉模型想要的文本格式，一般为字符串形式；格式解析方法就是将模型输出的字符串解析为最终的输出格式。一个可选方法：Parse with prompt，将模型输出和prompt一起输入给模型以修复得到想要输出结构，该方法通常用于重试修复retry parser。

以list parser为例：

格式说明的prompt如下，也是比较经典的规范说明+例子的形式给出。

python 复制代码

output_parser = CommaSeparatedListOutputParser()

format_instructions = output_parser.get_format_instructions()

print(format_instructions)

>>>Your response should be a list of comma separated values, eg: `foo, bar, baz`

格式解析方法则更加简单，如下。

python 复制代码

def parse(self, text: str) -> List[str]:
    """Parse the output of an LLM call."""
    return text.strip().split(", ")

Structured output parser

Structured output parser 可以实现返回多个字段格式，不同于Pydantic (JSON) parser使用诗句结构描述注入指令，该方法是通过纯文本描述字段注入指令完成解析。

示例方法如下，通过字段描述注入到prompt中生成格式说明，最终模型输出的应该是一个标准的json格式。

python 复制代码

from langchain.output_parsers import StructuredOutputParser, ResponseSchema
response_schemas = [
    ResponseSchema(name="answer", description="answer to the user's question"),
    ResponseSchema(name="source", description="source used to answer the user's question, should be a website.")
]

output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

>>>The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"answer": string  // answer to the user's question
	"source": string  // source used to answer the user's question, should be a website.
}
```

而它的parser方法就是将json转化成markdown格式输出。

Auto-fixing parser

上面的结构解析方法是比较理想状态下的结果，所以LangChain也提供了修复的parser，结构解析失败后会调用另一个LLM模型进行修复。

下面是重试修复调用的LLM模型的prompt。其中instructions为当前使用parser的prompt，可以通过parser.get_format_instructions()方法得到，completion为第一次调用模型时的输出(即解析失败的输出)，error为当前解析报错信息。

python 复制代码

NAIVE_FIX = """Instructions:
--------------
{instructions}
--------------
Completion:
--------------
{completion}
--------------

Above, the Completion did not satisfy the constraints given in the Instructions.
Error:
--------------
{error}
--------------

Please try again. Please only respond with an answer that satisfies the constraints laid out in the Instructions:"""

Retry parser

auto-fixing parser修复的情况是模型输出的结果格式有问题，但是如果模型输出的问题不止是格式，在内容上也有出入的时候，就需要用到Retry parser了。retry parser使用parse_with_prompt进行格式修复重试。其中有两个类RetryOutputParser和RetryWithErrorOutputParser，两者的区别是除了LLM模型的输出文本和parser以外，后者的输入需要传入错误描述参数。

RetryOutputParser的prompt：

python 复制代码

"""Prompt:
{prompt}
Completion:
{completion}

Above, the Completion did not satisfy the constraints given in the Prompt.
Please try again:"""

RetryWithErrorOutputParser的prompt：

python 复制代码

 """Prompt:
{prompt}
Completion:
{completion}

Above, the Completion did not satisfy the constraints given in the Prompt.
Details: {error}
Please try again:"""

总结：

Langchain的强大之处在于抽象的各种模块，集成的工具api以及设计好的prompt。parser就是通过设计的各种prompt将生成模型的输出变得可控，生成人类或者程序能更好理解的结构化数据。