你改 prompt、换模型、加 RAG、加工具调用之后,最难的问题往往不是"怎么改",而是:
怎么证明改动真的变好了?
这篇直接给你一个最小可落地的评测闭环:
- 离线:
eval.jsonl回归集(50 条先跑起来) - 自动评测:LLM-as-judge(rubric + JSON 校验)
- 线上:关键字段打点(能回答"哪里慢/哪里贵/哪里错")
- 发布:门禁 + 灰度 + 回滚
0)目录结构(建议照抄)
text
eval/
eval.jsonl
run_eval.py
judge.py
aggregate.py
1)评测集格式(eval.jsonl)
别写"标准答案全文",写关键点(更稳定、更好判分):
json
{"id":"q001","type":"chat","question":"...","gold_keypoints":["...","..."],"notes":"判分口径"}
{"id":"q101","type":"rag","question":"...","gold_keypoints":["..."],"gold_evidence_ids":["docA#p3"]}
{"id":"q201","type":"tool","question":"...","gold_keypoints":["..."],"expected_tool_calls":["get_order_status"]}
2)统一 LLM 调用(OpenAI 兼容)
bash
pip install -U openai
python
# run_eval.py
from openai import OpenAI
import time
def call_llm(base_url: str, api_key: str, model: str, messages: list[dict]) -> dict:
client = OpenAI(api_key=api_key, base_url=base_url)
t0 = time.time()
resp = client.chat.completions.create(model=model, messages=messages)
latency_ms = int((time.time() - t0) * 1000)
usage = getattr(resp, "usage", None)
return {
"text": resp.choices[0].message.content,
"latency_ms": latency_ms,
"usage": usage.model_dump() if usage else None,
}
说明:如果你用的是某 OpenAI 兼容入口(例如大模型聚合平台147ai),多数时候只改:
base_url=https://147ai.com/v1,端点POST /v1/chat/completions,鉴权Authorization: Bearer <KEY>(以控制台/文档为准)。
3)LLM-as-judge(rubric + JSON 校验)
评测员必须被"卡死输出格式",否则无法自动化。
python
# judge.py
import json
from run_eval import call_llm
RUBRIC = """你是评测员。对答案按规则输出 JSON(不要输出多余文本):
{
"keypoint_covered": 0,
"keypoint_total": 0,
"critical_error": false,
"rationale": "一句话解释"
}
判定说明:
1) keypoint_covered:答案覆盖了多少 gold_keypoints(只要表达同义即可算覆盖)
2) critical_error:出现严重事实错误/误导决策则为 true
只允许输出 JSON。"""
def robust_json_parse(s: str) -> dict:
# 最小实现:抓第一个 {...};生产建议做更严格的 JSON schema 校验
l = s.find("{")
r = s.rfind("}")
if l == -1 or r == -1 or r <= l:
raise ValueError("no json object")
return json.loads(s[l:r+1])
def judge(base_url: str, api_key: str, judge_model: str, question: str, answer: str, gold_keypoints: list[str]) -> dict:
messages = [
{"role": "system", "content": RUBRIC},
{"role": "user", "content": json.dumps({
"question": question,
"answer": answer,
"gold_keypoints": gold_keypoints
}, ensure_ascii=False)}
]
out = call_llm(base_url, api_key, judge_model, messages)
parsed = robust_json_parse(out["text"])
return {
"parsed": parsed,
"raw": out["text"],
"latency_ms": out["latency_ms"],
"usage": out["usage"],
}
4)跑回归:eval.jsonl → results.jsonl
python
# run_eval.py (continued)
import json
from judge import judge
def run_eval(eval_path: str, out_path: str, base_url: str, api_key: str, model_under_test: str, judge_model: str):
rows = [json.loads(l) for l in open(eval_path, "r", encoding="utf-8")]
with open(out_path, "w", encoding="utf-8") as f:
for r in rows:
q = r["question"]
msgs = [
{"role": "system", "content": "你是严谨的技术助手。请按要点回答。"},
{"role": "user", "content": q},
]
ans = call_llm(base_url, api_key, model_under_test, msgs)
j = judge(base_url, api_key, judge_model, q, ans["text"], r.get("gold_keypoints", []))
f.write(json.dumps({
"id": r.get("id"),
"type": r.get("type"),
"answer": ans,
"judge": j,
}, ensure_ascii=False) + "\n")
5)聚合指标:把结果变成"能门禁的数字"
python
# aggregate.py
import json
def agg(path: str):
rows = [json.loads(l) for l in open(path, "r", encoding="utf-8")]
n = len(rows)
covered = 0
total = 0
critical_err = 0
p95 = 0
latencies = []
for r in rows:
j = r["judge"]["parsed"]
covered += int(j.get("keypoint_covered", 0))
total += int(j.get("keypoint_total", 0))
critical_err += 1 if j.get("critical_error") else 0
latencies.append(int(r["answer"].get("latency_ms", 0)))
latencies.sort()
if latencies:
p95 = latencies[int(0.95 * (len(latencies) - 1))]
print({
"n": n,
"keypoint_coverage": (covered / total) if total else None,
"critical_error_rate": (critical_err / n) if n else None,
"latency_p95_ms": p95,
})
if __name__ == "__main__":
agg("results.jsonl")
门禁建议:优先用
critical_error_rate作为硬门禁,其次看覆盖率与 P95 延迟/成本。
6)线上指标(最小字段集)
离线能告诉你"质量有没有退化",线上要告诉你"为什么退化":
- 版本字段 :
model、model_version、prompt_version、retrieval_onoff - 体验/成本 :
latency_ms(P50/P95)、input_tokens/output_tokens/total_tokens - 稳定性 :
error_code、retry_count、fallback - RAG/工具 :
retrieval_k、(如有)hit@k、tool_calls、tool_success
7)发布门禁(离线 + 灰度 + 回滚)
可执行模板:
- 离线回归:严重错误率不升高、覆盖率不下降、token/P95 不超预算
- 灰度:1%→5%→20%,监控异常告警
- 回滚:能一键回滚
model / prompt_version / retrieval_onoff
如果你要把这一套进一步工程化,我建议下一步做:
- 把结果落到数据库(支持按版本对比)
- 对 judge 输出做 schema 校验 + 重试 + 抽检
- 引入 RAG 专属指标:hit@k / cite_acc(先评检索再评生成)