LLM 应用评测闭环:eval.jsonl + LLM-as-judge + 线上指标(含 Python 最小实现)

你改 prompt、换模型、加 RAG、加工具调用之后,最难的问题往往不是"怎么改",而是:

怎么证明改动真的变好了?

这篇直接给你一个最小可落地的评测闭环:

  • 离线:eval.jsonl 回归集(50 条先跑起来)
  • 自动评测:LLM-as-judge(rubric + JSON 校验)
  • 线上:关键字段打点(能回答"哪里慢/哪里贵/哪里错")
  • 发布:门禁 + 灰度 + 回滚

0)目录结构(建议照抄)

text 复制代码
eval/
  eval.jsonl
  run_eval.py
  judge.py
  aggregate.py

1)评测集格式(eval.jsonl)

别写"标准答案全文",写关键点(更稳定、更好判分):

json 复制代码
{"id":"q001","type":"chat","question":"...","gold_keypoints":["...","..."],"notes":"判分口径"}
{"id":"q101","type":"rag","question":"...","gold_keypoints":["..."],"gold_evidence_ids":["docA#p3"]}
{"id":"q201","type":"tool","question":"...","gold_keypoints":["..."],"expected_tool_calls":["get_order_status"]}

2)统一 LLM 调用(OpenAI 兼容)

bash 复制代码
pip install -U openai
python 复制代码
# run_eval.py
from openai import OpenAI
import time

def call_llm(base_url: str, api_key: str, model: str, messages: list[dict]) -> dict:
    client = OpenAI(api_key=api_key, base_url=base_url)
    t0 = time.time()
    resp = client.chat.completions.create(model=model, messages=messages)
    latency_ms = int((time.time() - t0) * 1000)
    usage = getattr(resp, "usage", None)
    return {
        "text": resp.choices[0].message.content,
        "latency_ms": latency_ms,
        "usage": usage.model_dump() if usage else None,
    }

说明:如果你用的是某 OpenAI 兼容入口(例如大模型聚合平台147ai),多数时候只改:
base_url=https://147ai.com/v1,端点 POST /v1/chat/completions,鉴权 Authorization: Bearer <KEY>(以控制台/文档为准)。


3)LLM-as-judge(rubric + JSON 校验)

评测员必须被"卡死输出格式",否则无法自动化。

python 复制代码
# judge.py
import json
from run_eval import call_llm

RUBRIC = """你是评测员。对答案按规则输出 JSON(不要输出多余文本):
{
  "keypoint_covered": 0,
  "keypoint_total": 0,
  "critical_error": false,
  "rationale": "一句话解释"
}
判定说明:
1) keypoint_covered:答案覆盖了多少 gold_keypoints(只要表达同义即可算覆盖)
2) critical_error:出现严重事实错误/误导决策则为 true
只允许输出 JSON。"""

def robust_json_parse(s: str) -> dict:
    # 最小实现:抓第一个 {...};生产建议做更严格的 JSON schema 校验
    l = s.find("{")
    r = s.rfind("}")
    if l == -1 or r == -1 or r <= l:
        raise ValueError("no json object")
    return json.loads(s[l:r+1])

def judge(base_url: str, api_key: str, judge_model: str, question: str, answer: str, gold_keypoints: list[str]) -> dict:
    messages = [
        {"role": "system", "content": RUBRIC},
        {"role": "user", "content": json.dumps({
            "question": question,
            "answer": answer,
            "gold_keypoints": gold_keypoints
        }, ensure_ascii=False)}
    ]
    out = call_llm(base_url, api_key, judge_model, messages)
    parsed = robust_json_parse(out["text"])
    return {
        "parsed": parsed,
        "raw": out["text"],
        "latency_ms": out["latency_ms"],
        "usage": out["usage"],
    }

4)跑回归:eval.jsonl → results.jsonl

python 复制代码
# run_eval.py (continued)
import json
from judge import judge

def run_eval(eval_path: str, out_path: str, base_url: str, api_key: str, model_under_test: str, judge_model: str):
    rows = [json.loads(l) for l in open(eval_path, "r", encoding="utf-8")]
    with open(out_path, "w", encoding="utf-8") as f:
        for r in rows:
            q = r["question"]
            msgs = [
                {"role": "system", "content": "你是严谨的技术助手。请按要点回答。"},
                {"role": "user", "content": q},
            ]
            ans = call_llm(base_url, api_key, model_under_test, msgs)
            j = judge(base_url, api_key, judge_model, q, ans["text"], r.get("gold_keypoints", []))
            f.write(json.dumps({
                "id": r.get("id"),
                "type": r.get("type"),
                "answer": ans,
                "judge": j,
            }, ensure_ascii=False) + "\n")

5)聚合指标:把结果变成"能门禁的数字"

python 复制代码
# aggregate.py
import json

def agg(path: str):
    rows = [json.loads(l) for l in open(path, "r", encoding="utf-8")]
    n = len(rows)
    covered = 0
    total = 0
    critical_err = 0
    p95 = 0
    latencies = []
    for r in rows:
        j = r["judge"]["parsed"]
        covered += int(j.get("keypoint_covered", 0))
        total += int(j.get("keypoint_total", 0))
        critical_err += 1 if j.get("critical_error") else 0
        latencies.append(int(r["answer"].get("latency_ms", 0)))
    latencies.sort()
    if latencies:
        p95 = latencies[int(0.95 * (len(latencies) - 1))]
    print({
        "n": n,
        "keypoint_coverage": (covered / total) if total else None,
        "critical_error_rate": (critical_err / n) if n else None,
        "latency_p95_ms": p95,
    })

if __name__ == "__main__":
    agg("results.jsonl")

门禁建议:优先用 critical_error_rate 作为硬门禁,其次看覆盖率与 P95 延迟/成本。


6)线上指标(最小字段集)

离线能告诉你"质量有没有退化",线上要告诉你"为什么退化":

  • 版本字段modelmodel_versionprompt_versionretrieval_onoff
  • 体验/成本latency_ms(P50/P95)input_tokens/output_tokens/total_tokens
  • 稳定性error_coderetry_countfallback
  • RAG/工具retrieval_k、(如有)hit@ktool_callstool_success

7)发布门禁(离线 + 灰度 + 回滚)

可执行模板:

  1. 离线回归:严重错误率不升高、覆盖率不下降、token/P95 不超预算
  2. 灰度:1%→5%→20%,监控异常告警
  3. 回滚:能一键回滚 model / prompt_version / retrieval_onoff

如果你要把这一套进一步工程化,我建议下一步做:

  • 把结果落到数据库(支持按版本对比)
  • 对 judge 输出做 schema 校验 + 重试 + 抽检
  • 引入 RAG 专属指标:hit@k / cite_acc(先评检索再评生成)
相关推荐
小白点point1 天前
决战紫禁之巅:Opencode vs Claude Code,谁才是你的真·赛博义父?
ai编程·claude
孟健1 天前
我终于把 AdSense 提现到国内银行卡了(PIN 信/税务/电汇/结汇全流程)
ai编程·产品·创业
哥只是传说中的小白1 天前
Nano Banana Pro高并发接入Grsai Api实战!0.09/张无限批量生成(附接入实战+开源工具)
开发语言·数据库·ai作画·开源·aigc·php·api
向量引擎1 天前
【万字硬核】解密GPT-5.2-Pro与Sora2底层架构:从Transformer到世界模型,手撸一个高并发AI中台(附Python源码+压测报告)
人工智能·gpt·ai·aigc·ai编程·ai写作·api调用
starrytky1 天前
5 分钟装好 OpenCode,window用户需要的避坑指南
ai编程
paopao_wu1 天前
LangChainV1.0[05]-记忆管理
人工智能·python·langchain·ai编程
zuozewei1 天前
7D-AI系列:Vibe Coding VS Spec Coding AI 编程的两种范式对比
人工智能·ai编程
草帽lufei1 天前
Claude Code最强开源对手OpenCode实测:免费使用GLM-4.7/MiniMax等高级模型
ai编程·claude·trae