写给自己的 LangChain 开发教程（二）：格式化数据 & 提取 & 分类

这个教程是一边学习一边写的，中间可能会出现一些疏漏或者错误，如果您看到该篇文章并且发现了问题，还请多多指教！

1. 格式化数据

在实际应用中，我们的应用的前端和后端之间的数据交换是一个确切参数的 json 数据，我们可能会期望大模型将不可控的用户输入，转换成我们定义好的某个规整结构进行输出，这是一个很常见的需求，我们可以通过 LangChain 结合 Zod 或者 JSON Scheme 来实现我们的需求。

a. Zod

Zod 是一个转为 Typescript 应用设计的校验库

如果我们希望格式化数据，可能的流程是这样

定义一个工具函数 format_tool
根据用户输入的内容来决定是否需要调用 tool 转换
解析模型输出的内容转换为最终输出的 json

LangChain js-sdk 对 Zod 进行了深度的集成，提供了内置的 withStructuredOutput api 来让我们可以方便的格式化输入内容至我们定义好的数据。

ts 复制代码

import { ChatOllama } from '@langchain/ollama'
import { z } from 'zod'

const llm = new ChatOllama({
  model: 'qwen3:32b'
})

const personScheme = z.object({
  name: z.string().describe('The name of the person'),
  age: z.number().describe('The age of the person')
})

const structuredLLM = llm.withStructuredOutput(personScheme, {
  // 一定要传这个参数，告诉模型以纯json的格式输出，不然可能会报错 OUTPUT_PARSING_FAILURE
  method: 'json_mode'
})

const result  = await structuredLLM.invoke('他的姓名是tocka，今年26岁了')
console.log (result)
// 输出 { age: 26, name: 'tocka' }

b. JSON Scheme

JSON Scheme 是在前后端需要使用同一套校验逻辑的时候经常使用的结构，不需要借助外部库的情况下我们也能很轻松的看明白它所表达的内容。在 LangChain 中使用 JSON Scheme 的方式来格式化数据会稍微麻烦一点。

ts 复制代码

import { ChatOllama } from '@langchain/ollama'
import { JsonOutputParser } from '@langchain/core/output_parsers'
import { ChatPromptTemplate } from '@langchain/core/prompts'

const llm = new ChatOllama({
  model: 'qwen3:32b'
})

const personScheme = {
  "type": "object",
  "properties": {
    "name": {
      "type": "string",
      "description": "The name of the person.",
      "minLength": 1
    },
    "age": {
      "type": "number",
      "description": "The age of the person.",
      "minimum": 0
    }
  },
  "required": ["name", "age"]
}
const parser = new JsonOutputParser()

const promptTemplate = ChatPromptTemplate.fromTemplate(`
Extract the person information from the following text and return it as JSON according to this schema:

Schema: {schema}

Text: {text}

Please return only valid JSON that matches the schema.
`)

const prompt = await promptTemplate.invoke({
  schema: personScheme,
  text: '他的姓名是tocka，今年26岁了'
})

const output = await llm.invoke(prompt)
const result = await parser.invoke(output)

console.log (result)
// 输出 { age: 26, name: 'tocka' }

LangChain 为每个可运行的组件实现了 .pipe() 方法，用来构建一个流程链，当我们调用链的时候它的组件会依次调用每个组件的 invoke 方法，将返回值传给链上的下一个组件，所以我们的代码可以简化一下

ts 复制代码

const chain = promptTemplate.pipe(llm).pipe(parser)
const result = await chain.invoke({
  schema: personScheme,
  text: '他的姓名是tocka，今年26岁了'
})

console.log (result)
// 输出 { age: 26, name: 'tocka' }

2. 提取 & 分类

在格式化数据的应用场景中，提取内容和进行分类也是一种很常见的需求，想象我们正在管理一个商品，这个商品有很多条评价，而评价有好评，中评，差评，这三个纬度可能是从评分直接展示，也有可能是需要根据用户的语义来分析获取的，现在你需要根据评价的内容，将评价转换为一个 { 评价总体倾向，优点，缺点 } 的 json 数据，假设我们的商品是一款电子手表。

ts 复制代码

import { ChatOllama } from '@langchain/ollama'
import { JsonOutputParser } from '@langchain/core/output_parsers'
import { ChatPromptTemplate } from '@langchain/core/prompts'
import { z } from 'zod'

const llm = new ChatOllama({
  model: 'qwen3:32b'
})

// 使用 nullish 让模型可以将字段置为 null 而不是乱编
const schema = z.object({
  sentiment: z.enum(['positive','neutral','negative']).describe('评论对商品的总体评价倾向'),
  advantage: z.nullish(z.string()).describe('评论提到的具体的优点'),
  disadvantage: z.nullish(z.string()).describe('评论提到的具体的缺点'),
})

const data = [
  "这款手表的电池续航太棒了，充一次电能用好几天。",
  "手表外观很时尚，佩戴起来也很舒适，但心率监测功能似乎不太准。",
  "考虑到价格，这块表还不错，不过打开应用时有点慢。",
  "我喜欢它的设计，屏幕也很清晰。运动追踪功能非常详细。",
  "软件有很多bug，经常卡顿闪退。我对它的性能很失望。"
]

const structuredLLM = llm.withStructuredOutput(schema, {
  method: 'json_mode'
})

const prompt = ChatPromptTemplate.fromTemplate(`
    本商品是一款电子手表，以下输入是用户对它的评价，请从中提取商品的优点和缺点，并判断评价的整体倾向
    input: {commend}
  `)

const chain = prompt.pipe(structuredLLM)

const result = data.map((commend) => chain.invoke({ commend }))

console.log(await Promise.all(result))

json 复制代码

// 输出：
[
  {
    advantage: '电池续航太棒了，充一次电能用好几天',
    disadvantage: null,
    sentiment: 'positive'
  },
  {
    advantage: '手表外观时尚，佩戴舒适',
    disadvantage: '心率监测功能不太准',
    sentiment: 'neutral'
  },
  {
    advantage: '考虑到价格，这块表还不错',
    disadvantage: '打开应用时有点慢',
    sentiment: 'neutral'
  },
  {
    advantage: '设计、屏幕清晰、运动追踪功能详细',
    disadvantage: null,
    sentiment: 'positive'
  },
  { disadvantage: '软件有很多bug，经常卡顿闪退', sentiment: 'negative' }
]

3. 小结

通过本篇我们学习了如何通过 Zod 和 JSON Scheme 格式化数据，这部份内容虽然比较基础，但是在我们编排流程时会经常用到，例如根据用户的输入来构建查询参数等等。