用于评估大语言模型（LLMs）能力的重要基准任务（Benchmark）

文军的烹饪实验室2025-06-13 14:13

基准任务涵盖了多领域（如语言理解、数学、推理、编程、医学等）和多能力维度（如事实检索、计算、代码生成、链式推理、多语言处理）。常用于模型发布时的对比评测，例如 GPT-4、Claude、Gemini、Mistral 等模型的论文或报告中。

Benchmark	简介	用途	地址	许可证
MMLU	Massive Multitask Language Understanding	测试模型在多学科考试（如历史、法律、医学等）中的表现	https://arxiv.org/abs/2009.03300, https://github.com/hendrycks/test	MIT License
MATH	Mathematical Problem Solving	测试模型解决中学和大学级数学问题的能力	https://arxiv.org/abs/2103.03874, https://github.com/hendrycks/math	MIT License
GPQA	Graduate-level, Google-proof Q&A	高阶、无法通过搜索引擎解答的物理问答题	https://arxiv.org/abs/2311.12022, https://github.com/idavidrein/gpqa/	MIT License
DROP	Discrete Reasoning Over Paragraphs	阅读理解测试，侧重数值运算、推理和信息整合	https://arxiv.org/abs/1903.00161, https://allenai.org/data/drop	Apache 2.0
MGSM	Multilingual Grade School Math	多语言小学数学题，考察链式思维能力	https://arxiv.org/abs/2210.03057, https://github.com/google-research/url-nlp	CC-BY 4.0
HumanEval	Code Generation and Evaluation	模型在 Python 编程题上的代码生成与准确性测试	https://arxiv.org/abs/2107.03374, https://github.com/openai/human-eval	MIT License
SimpleQA	Short-form Factuality Benchmark	测试模型对简单事实问答（如"地球离太阳多远？"）的准确性	https://openai.com/index/introducing-simpleqa	MIT License
BrowseComp	Web-based Browsing Agent Task	测试具有浏览网页能力的智能体在任务场景中的能力	https://openai.com/index/browsecomp	MIT License
HealthBench	Health-related LLM Evaluation	面向医疗健康场景的模型能力评估，强调事实准确性和安全性	https://openai.com/index/healthbench	MIT License