文章目录
-
- [你的第一个测试 - 从 Hello World 开始](#你的第一个测试 - 从 Hello World 开始)
-
- [3.1 场景设定:客服机器人评估](#3.1 场景设定:客服机器人评估)
- [3.2 基础测试:答案相关性](#3.2 基础测试:答案相关性)
- [3.3 多指标联合测试](#3.3 多指标联合测试)
- [3.4 批量评估](#3.4 批量评估)
- [3.5 使用 Pytest 参数化](#3.5 使用 Pytest 参数化)

你的第一个测试 - 从 Hello World 开始
3.1 场景设定:客服机器人评估
假设你开发了一个电商客服机器人,我们来一步步构建评估。
3.2 基础测试:答案相关性
python
# test_customer_support.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_refund_policy():
"""
测试客服机器人关于退款政策的回答是否相关
想象你在考试,老师问"退款政策是什么?"
学生回答"我们提供免费 shipping"------这就是不相关!
"""
answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."]
)
assert_test(test_case, [answer_relevancy])
运行:
bash
deepeval test run test_customer_support.py
小白划重点 🎯
AnswerRelevancyMetric只关心:回答是否直接回应了问题?它不看回答对不对,只看回答和问题是否"搭边"。
比如:
Q: "退款政策是什么?" A: "我们提供 30 天退款" → ✅ 相关
Q: "退款政策是什么?" A: "我们成立于 2010 年" → ❌ 不相关
常见坑点 ⚠️坑 1: 以为 AnswerRelevancy 检查正确性 → 不,它只检查相关性!
坑 2 : 给
retrieval_context但不给actual_output→ 两个都是必需的坑 3: 用中文提问但评估 LLM 不支持中文 → 确保你的评估模型支持目标语言
3.3 多指标联合测试
python
# test_rag_pipeline.py
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualRelevancyMetric
)
def test_rag_response():
"""
全面评估 RAG 系统的回答质量
就像评价一道菜:
- AnswerRelevancy: 这道菜回答了客人的问题吗?
- Faithfulness: 厨师有没有胡说八道?
- ContextualRelevancy: 用的食材新鲜吗?
"""
metrics = [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
ContextualRelevancyMetric(threshold=0.6)
]
test_case = LLMTestCase(
input="What is the return policy for electronics?",
actual_output="Electronics can be returned within 14 days of purchase with the original receipt. Items must be in original packaging and unused condition.",
retrieval_context=[
"Return Policy: Electronics must be returned within 14 days.",
"Original receipt is required for all returns.",
"Items must be in original packaging and unused condition."
]
)
assert_test(test_case, metrics)
3.4 批量评估
python
# test_batch.py
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
# 一批测试用例
test_cases = [
LLMTestCase(
input="How do I track my order?",
actual_output="You can track your order by logging into your account..."
),
LLMTestCase(
input="What payment methods do you accept?",
actual_output="We accept credit cards, PayPal, and Apple Pay."
),
LLMTestCase(
input="Do you ship internationally?",
actual_output="Yes, we ship to over 50 countries worldwide."
)
]
# 批量评估
results = evaluate(
test_cases=test_cases,
metrics=[AnswerRelevancyMetric()]
)
# 查看结果
for result in results:
print(f"Input: {result.test_case.input}")
print(f"Passed: {result.success}")
print(f"Score: {result.metrics[0].score}")
print("---")
3.5 使用 Pytest 参数化
python
# test_parametrized.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
# 测试数据
test_data = [
("How do I reset my password?", "Go to settings and click 'Reset Password'."),
("What are your business hours?", "We're open 9 AM to 6 PM, Monday to Friday."),
("Do you have a mobile app?", "Yes, our app is available on iOS and Android."),
]
@pytest.mark.parametrize("input_text,actual_output", test_data)
def test_faq_relevancy(input_text, actual_output):
metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(input=input_text, actual_output=actual_output)
assert_test(test_case, [metric])