大模型测评从入门到精通 -deepeval 第一个测试从 Hello World 开始

文章目录

    • [你的第一个测试 - 从 Hello World 开始](#你的第一个测试 - 从 Hello World 开始)
      • [3.1 场景设定:客服机器人评估](#3.1 场景设定:客服机器人评估)
      • [3.2 基础测试:答案相关性](#3.2 基础测试:答案相关性)
      • [3.3 多指标联合测试](#3.3 多指标联合测试)
      • [3.4 批量评估](#3.4 批量评估)
      • [3.5 使用 Pytest 参数化](#3.5 使用 Pytest 参数化)

你的第一个测试 - 从 Hello World 开始

3.1 场景设定:客服机器人评估

假设你开发了一个电商客服机器人,我们来一步步构建评估。

3.2 基础测试:答案相关性

python 复制代码
# test_customer_support.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric


def test_refund_policy():
    """
    测试客服机器人关于退款政策的回答是否相关
    
    想象你在考试,老师问"退款政策是什么?"
    学生回答"我们提供免费 shipping"------这就是不相关!
    """
    answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
    
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="We offer a 30-day full refund at no extra cost.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."]
    )
    
    assert_test(test_case, [answer_relevancy])

运行:

bash 复制代码
deepeval test run test_customer_support.py

小白划重点 🎯

AnswerRelevancyMetric 只关心:回答是否直接回应了问题?

它不看回答对不对,只看回答和问题是否"搭边"。

比如:

  • Q: "退款政策是什么?" A: "我们提供 30 天退款" → ✅ 相关

  • Q: "退款政策是什么?" A: "我们成立于 2010 年" → ❌ 不相关
    常见坑点 ⚠️

  • 坑 1: 以为 AnswerRelevancy 检查正确性 → 不,它只检查相关性!

  • 坑 2 : 给 retrieval_context 但不给 actual_output → 两个都是必需的

  • 坑 3: 用中文提问但评估 LLM 不支持中文 → 确保你的评估模型支持目标语言

3.3 多指标联合测试

python 复制代码
# test_rag_pipeline.py
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualRelevancyMetric
)


def test_rag_response():
    """
    全面评估 RAG 系统的回答质量
    
    就像评价一道菜:
    - AnswerRelevancy: 这道菜回答了客人的问题吗?
    - Faithfulness: 厨师有没有胡说八道?
    - ContextualRelevancy: 用的食材新鲜吗?
    """
    metrics = [
        AnswerRelevancyMetric(threshold=0.7),
        FaithfulnessMetric(threshold=0.8),
        ContextualRelevancyMetric(threshold=0.6)
    ]
    
    test_case = LLMTestCase(
        input="What is the return policy for electronics?",
        actual_output="Electronics can be returned within 14 days of purchase with the original receipt. Items must be in original packaging and unused condition.",
        retrieval_context=[
            "Return Policy: Electronics must be returned within 14 days.",
            "Original receipt is required for all returns.",
            "Items must be in original packaging and unused condition."
        ]
    )
    
    assert_test(test_case, metrics)

3.4 批量评估

python 复制代码
# test_batch.py
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

# 一批测试用例
test_cases = [
    LLMTestCase(
        input="How do I track my order?",
        actual_output="You can track your order by logging into your account..."
    ),
    LLMTestCase(
        input="What payment methods do you accept?",
        actual_output="We accept credit cards, PayPal, and Apple Pay."
    ),
    LLMTestCase(
        input="Do you ship internationally?",
        actual_output="Yes, we ship to over 50 countries worldwide."
    )
]

# 批量评估
results = evaluate(
    test_cases=test_cases,
    metrics=[AnswerRelevancyMetric()]
)

# 查看结果
for result in results:
    print(f"Input: {result.test_case.input}")
    print(f"Passed: {result.success}")
    print(f"Score: {result.metrics[0].score}")
    print("---")

3.5 使用 Pytest 参数化

python 复制代码
# test_parametrized.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

# 测试数据
test_data = [
    ("How do I reset my password?", "Go to settings and click 'Reset Password'."),
    ("What are your business hours?", "We're open 9 AM to 6 PM, Monday to Friday."),
    ("Do you have a mobile app?", "Yes, our app is available on iOS and Android."),
]

@pytest.mark.parametrize("input_text,actual_output", test_data)
def test_faq_relevancy(input_text, actual_output):
    metric = AnswerRelevancyMetric(threshold=0.7)
    test_case = LLMTestCase(input=input_text, actual_output=actual_output)
    assert_test(test_case, [metric])