使用LangSmith评估智能体

评估（Evaluations）是衡量大模型（LLM）应用程序性能的一种量化方式。LLM 的行为可能存在不确定性，即便对提示词、模型或输入做出微小调整，也可能对结果产生显著影响。而评估能够提供一种结构化方法，帮助识别故障、对比版本，并构建更可靠的人工智能应用程序。

在LangSmith中执行评估需要三个关键组件：

数据集（Dataset）：一组测试输入（可选包含预期输出）。
目标函数（Target function）：你希望测试的应用程序部分 ------ 可能是使用新提示词的单次LLM调用、某个模块，或是整个工作流程。
评估器（Evaluators）：为目标函数的输出打分的函数。

本文将使用LangSmith SDK带你完成一次基础评估（验证LLM 响应的正确性）。

准备

一个 LangSmith 账号：可在 smith.langchain.com 注册或登录。
一个 LangSmith API 密钥：请参考《创建 API 密钥指南》操作。
一个 OpenAI API 密钥：可在 OpenAI 控制台（dashboard）中生成

安装依赖

bash 复制代码

pip install -U langsmith openevals openai

设置环境变量

bash 复制代码

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langsmith-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"
export LANGSMITH_WORKSPACE_ID="<your-workspace-id>"

创建数据集

创建一个文件并添加以下代码，代码将实现以下功能：

导入 Client 以连接 LangSmith。
创建一个数据集。
定义示例输入和输出。
在 LangSmith 中关联输入输出对与该数据集，以便它们能在评估中使用。

python 复制代码

# dataset.py
from langsmith import Client

def main():
    client = Client()

    # Programmatically create a dataset in LangSmith
    dataset = client.create_dataset(
        dataset_name="Sample dataset",
        description="A sample dataset in LangSmith."
    )

    # Create examples
    examples = [
        {
            "inputs": {"question": "Which country is Mount Kilimanjaro located in?"},
            "outputs": {"answer": "Mount Kilimanjaro is located in Tanzania."},
        },
        {
            "inputs": {"question": "What is Earth's lowest point?"},
            "outputs": {"answer": "Earth's lowest point is The Dead Sea."},
        },
    ]

    # Add examples to the dataset
    client.create_examples(dataset_id=dataset.id, examples=examples)
    print("Created dataset:", dataset.name)

if __name__ == "__main__":
    main()

运行文件以创建数据集：

bash 复制代码

python dataset.py

创建目标函数

定义一个包含待评估内容的目标函数（target function）。在本文中，将定义一个目标函数，该函数包含一次用于回答问题的大模型调用。

python 复制代码

# eval.py
from langsmith import Client, wrappers
from openai import OpenAI

# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())

# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
    response = openai_client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": "Answer the following question accurately"},
            {"role": "user", "content": inputs["question"]},
        ],
    )
    return {"answer": response.choices[0].message.content.strip()}

定义评估器

在本步骤中，你需要告知LangSmith如何为应用程序生成的答案评分。从openevals中导入一个预构建的评估提示词（CORRECTNESS_PROMPT），以及一个辅助工具 ------ 该工具会将其封装为 "以 LLM 作为评判者" 的评估器，此评估器将为应用程序的输出打分。该评估器会对以下三类信息进行对比：输入（inputs）：传入目标函数的内容（例如，问题文本）。输出（outputs）：目标函数返回的内容（例如，模型生成的答案）。参考输出（reference_outputs）：在创建数据集时，你为每个数据集示例附加的基准真值答案（即 "标准答案"）。

python 复制代码

from langsmith import Client, wrappers
from openai import OpenAI
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())

# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
    response = openai_client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": "Answer the following question accurately"},
            {"role": "user", "content": inputs["question"]},
        ],
    )
    return {"answer": response.choices[0].message.content.strip()}

def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
    evaluator = create_llm_as_judge(
        prompt=CORRECTNESS_PROMPT,
        model="openai:o3-mini",
        feedback_key="correctness",
    )
    return evaluator(
        inputs=inputs,
        outputs=outputs,
        reference_outputs=reference_outputs
    )

运行评估实验

若要运行评估实验，需调用 evaluate(...) 函数，该函数将执行以下操作：

从数据集里提取示例数据。
将每个示例的输入传递至目标函数。
收集输出结果（即模型生成的答案）。
将输出结果与参考输出一同传递至评估器。
在LangSmith中以实验形式记录所有结果，以便你在界面（UI）中查看。

python 复制代码

from langsmith import Client, wrappers
from openai import OpenAI
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())

# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
    response = openai_client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": "Answer the following question accurately"},
            {"role": "user", "content": inputs["question"]},
        ],
    )
    return {"answer": response.choices[0].message.content.strip()}

def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
    evaluator = create_llm_as_judge(
        prompt=CORRECTNESS_PROMPT,
        model="openai:o3-mini",
        feedback_key="correctness",
    )
    return evaluator(
        inputs=inputs,
        outputs=outputs,
        reference_outputs=reference_outputs
    )

# After running the evaluation, a link will be provided to view the results in langsmith
def main():
    client = Client()
    experiment_results = client.evaluate(
        target,
        data="Sample dataset",
        evaluators=[
            correctness_evaluator,
            # can add multiple evaluators here
        ],
        experiment_prefix="first-eval-in-langsmith",
        max_concurrency=2,
    )
    print(experiment_results)

if __name__ == "__main__":
    main()