通用大模型压测报告工具

前言

我们部署完大模型在上线之前需要做一个压测报告,需要涵盖业界标准的压测指标,并输出结构化的压测报告。接下来我们写一个通用的压测报告代码。

核心压测指标

指标类别 具体指标 说明
吞吐量 Tokens per second (TPS) 每秒生成的 token 数,衡量系统整体处理能力
延迟 Time to First Token (TTFT) 首 token 延迟(ms),影响用户体验
Inter-token Latency token 间平均延迟(ms)
End-to-End Latency (E2E) 完整请求响应时间(ms)
并发能力 Max Concurrent Requests 系统能稳定支持的最大并发数
资源利用率 GPU Memory Usage, GPU Util%, CPU%, RAM 辅助分析瓶颈
错误率 Error Rate (%) 请求失败比例(超时、5xx 等)
P99/P95 延迟 P95 TTFT, P99 E2E 尾部延迟,反映服务质量稳定性

压测工具设计原则

  • 协议兼容:支持 OpenAI API 格式(主流推理后端均兼容)
  • 异步高并发:使用 asyncio + aiohttp 实现高效并发
  • 动态负载:支持固定并发、阶梯加压、RPS 控制
  • 结果结构化:输出 JSON + Markdown 报告
  • 可配置:通过 YAML/JSON 配置模型地址、prompt、并发数等

完整压测代码(Python)

python 复制代码
# llm_benchmark.py
import asyncio
import time
import json
import argparse
import numpy as np
import pandas as pd
from typing import List, Dict, Any
import aiohttp
from tqdm.asyncio import tqdm
import yaml

class LLMBenchmark:
    def __init__(self, config_path: str):
        with open(config_path, 'r') as f:
            self.config = yaml.safe_load(f)
        self.base_url = self.config['base_url'].rstrip('/')
        self.model = self.config['model']
        self.headers = {"Content-Type": "application/json"}
        if 'api_key' in self.config:
            self.headers["Authorization"] = f"Bearer {self.config['api_key']}"

        # 压测参数
        self.concurrency = self.config['concurrency']
        self.total_requests = self.config['total_requests']
        self.timeout = self.config.get('timeout', 120)
        self.max_tokens = self.config.get('max_tokens', 512)
        self.temperature = self.config.get('temperature', 0.0)

        # 存储结果
        self.results: List[Dict] = []

    async def send_request(self, session: aiohttp.ClientSession, prompt: str):
        payload = {
            "model": self.model,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": self.max_tokens,
            "temperature": self.temperature,
            "stream": False
        }
        start_time = time.time()
        try:
            async with session.post(
                f"{self.base_url}/v1/chat/completions",
                json=payload,
                headers=self.headers,
                timeout=aiohttp.ClientTimeout(total=self.timeout)
            ) as resp:
                response_time = time.time()
                if resp.status != 200:
                    error_text = await resp.text()
                    print(f"Error: {resp.status} - {error_text}")
                    return {
                        "success": False,
                        "error": f"HTTP {resp.status}",
                        "e2e_latency": response_time - start_time
                    }

                data = await resp.json()
                completion = data['choices'][0]['message']['content']
                prompt_tokens = data['usage']['prompt_tokens']
                completion_tokens = data['usage']['completion_tokens']
                total_tokens = data['usage']['total_tokens']

                ttft = None  # 非流式无法获取 TTFT,若需 TTFT 请启用 stream=True 并解析 SSE
                e2e_latency = response_time - start_time
                tps = completion_tokens / e2e_latency if e2e_latency > 0 else 0

                return {
                    "success": True,
                    "prompt_tokens": prompt_tokens,
                    "completion_tokens": completion_tokens,
                    "total_tokens": total_tokens,
                    "e2e_latency": e2e_latency,
                    "ttft": ttft,  # 可扩展为流式实现
                    "tps": tps,
                    "error": None
                }

        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "e2e_latency": time.time() - start_time
            }

    async def worker(self, session: aiohttp.ClientSession, prompts: List[str]):
        for prompt in prompts:
            result = await self.send_request(session, prompt)
            self.results.append(result)

    async def run(self):
        # 准备 prompts(可从文件加载或生成)
        prompts = self._generate_prompts()
        requests_per_worker = len(prompts) // self.concurrency
        remainder = len(prompts) % self.concurrency
        tasks = []
        async with aiohttp.ClientSession() as session:
            for i in range(self.concurrency):
                start_idx = i * requests_per_worker
                end_idx = start_idx + requests_per_worker
                if i < remainder:
                    end_idx += 1
                task_prompts = prompts[start_idx:end_idx]
                tasks.append(self.worker(session, task_prompts))
            await tqdm.gather(*tasks, desc="Running benchmark")

    def _generate_prompts(self) -> List[str]:
        # 可替换为从文件读取或使用真实数据集
        base_prompt = "Explain the theory of relativity in simple terms."
        return [base_prompt] * self.total_requests

    def generate_report(self) -> Dict[str, Any]:
        df = pd.DataFrame(self.results)
        total_requests = len(df)
        successful = df[df['success'] == True]
        failed = df[df['success'] == False]

        report = {
            "model": self.model,
            "total_requests": total_requests,
            "successful_requests": len(successful),
            "failed_requests": len(failed),
            "error_rate": len(failed) / total_requests * 100,
            "metrics": {}
        }

        if not successful.empty:
            e2e_latencies = successful['e2e_latency'].values
            tps_values = successful['tps'].values
            completion_tokens = successful['completion_tokens'].values

            report["metrics"] = {
                "throughput": {
                    "avg_tokens_per_sec": float(np.sum(completion_tokens) / np.sum(e2e_latencies)),
                    "avg_tps_per_request": float(np.mean(tps_values))
                },
                "latency_ms": {
                    "avg_e2e": float(np.mean(e2e_latencies) * 1000),
                    "p50_e2e": float(np.percentile(e2e_latencies, 50) * 1000),
                    "p95_e2e": float(np.percentile(e2e_latencies, 95) * 1000),
                    "p99_e2e": float(np.percentile(e2e_latencies, 99) * 1000),
                    "max_e2e": float(np.max(e2e_latencies) * 1000)
                },
                "token_stats": {
                    "avg_prompt_tokens": float(successful['prompt_tokens'].mean()),
                    "avg_completion_tokens": float(successful['completion_tokens'].mean()),
                    "avg_total_tokens": float(successful['total_tokens'].mean())
                }
            }

        # 错误详情
        if not failed.empty:
            report["errors"] = failed['error'].value_counts().to_dict()

        return report

    def save_report(self, output_path: str):
        report = self.generate_report()
        with open(output_path, 'w') as f:
            json.dump(report, f, indent=2)
        self._print_markdown_summary(report)

    def _print_markdown_summary(self, report: Dict):
        print("\n" + "="*60)
        print("📊 LLM 压测报告摘要")
        print("="*60)
        print(f"- 模型: `{report['model']}`")
        print(f"- 总请求数: {report['total_requests']}")
        print(f"- 成功请求: {report['successful_requests']}")
        print(f"- 失败请求: {report['failed_requests']}")
        print(f"- 错误率: {report['error_rate']:.2f}%")
        if 'metrics' in report and report['metrics']:
            m = report['metrics']
            print(f"\n📈 吞吐量:")
            print(f"  - 系统总 TPS: {m['throughput']['avg_tokens_per_sec']:.2f} tokens/s")
            print(f"  - 单请求平均 TPS: {m['throughput']['avg_tps_per_request']:.2f} tokens/s")
            print(f"\n⏱️ 延迟 (ms):")
            lat = m['latency_ms']
            print(f"  - 平均 E2E: {lat['avg_e2e']:.2f}")
            print(f"  - P95 E2E: {lat['p95_e2e']:.2f}")
            print(f"  - P99 E2E: {lat['p99_e2e']:.2f}")
            print(f"  - 最大 E2E: {lat['max_e2e']:.2f}")
            print(f"\n🔤 Token 统计:")
            tok = m['token_stats']
            print(f"  - 平均输入: {tok['avg_prompt_tokens']:.1f}")
            print(f"  - 平均输出: {tok['avg_completion_tokens']:.1f}")
        print("="*60)


# 配置文件示例 config.yaml
"""
base_url: http://localhost:8000
model: qwen3-30b-a3b-instruct-2507
concurrency: 16
total_requests: 200
max_tokens: 512
temperature: 0.0
# api_key: your-api-key-if-needed
"""

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", type=str, default="config.yaml", help="配置文件路径")
    parser.add_argument("--output", type=str, default="benchmark_report.json", help="输出报告路径")
    args = parser.parse_args()

    bench = LLMBenchmark(args.config)
    asyncio.run(bench.run())
    bench.save_report(args.output)

使用方法

1. 创建配置文件 config.yaml
yaml 复制代码
base_url: http://your-llm-server:8000
model: qwen3-30b-a3b-instruct-2507
concurrency: 32
total_requests: 500
max_tokens: 256
temperature: 0.0
2. 运行压测
python 复制代码
python llm_benchmark.py --config config.yaml --output qwen3_30b_report.json
3. 输出内容
  • qwen3_30b_report.json:完整结构化数据(可用于自动化分析)
  • 控制台打印 Markdown 风格摘要(便于汇报)
相关推荐
喜欢吃豆13 小时前
使用 OpenAI Responses API 构建生产级应用的终极指南—— 状态、流式、异步与文件处理
网络·人工智能·自然语言处理·大模型
楚国的小隐士1 天前
Qwen是“源神”?实际上GLM-4.6才是被低估的黑马
ai·大模型·通义千问·智谱清言
程序员鱼皮2 天前
又被 Cursor 烧了 1 万块,我麻了。。。
前端·后端·ai·程序员·大模型·编程
north_eagle2 天前
RAG 同 Prompt Engineering
大模型·prompt·rag
KG_LLM图谱增强大模型2 天前
[经典之作]大语言模型与知识图谱的融合:通往智能未来的路线图
人工智能·大模型·知识图谱·graphrag·本体论·图谱增强大模型
九年义务漏网鲨鱼3 天前
【大模型学习】现代大模型架构(二):旋转位置编码和SwiGLU
深度学习·学习·大模型·智能体
GPUStack3 天前
GPUStack v2:推理加速释放算力潜能,开源重塑大模型推理下半场
大模型·vllm·ai网关·sglang·高性能推理
WWZZ20253 天前
快速上手大模型:深度学习13(文本预处理、语言模型、RNN、GRU、LSTM、seq2seq)
人工智能·深度学习·算法·语言模型·自然语言处理·大模型·具身智能
core5123 天前
不借助框架实现Text2SQL
sql·mysql·ai·大模型·qwen·text2sql
有点不太正常3 天前
《ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs》——论文阅读
论文阅读·大模型·agent安全