《LIC·2025语言与智能技术竞赛——智源研究院赛道二》数据处理分析

一、竞赛介绍

1.数据集介绍

数据集地址：2025LIC赛事智源研究院赛道二-示例数据集_数据集-飞桨AI Studio星河社区

本赛道指定使用由智源研究院（BAAI）构建并开源的 OpenSeek 数据集作为原始数据。该数据集旨在通过合成高质量的推理数据，激活并提升大语言模型的复杂推理能力，其核心理念是从海量的原始语料库中，自动提取并生成包含显式推理过程的数据。

核心领域： 数据集主要涵盖数学、代码和通用知识等需要复杂推理能力的领域。
数据结构：

OpenSeek通过其合成管线，将原始文档（
复制代码
```
raw
```
）处理成结构化的样本。每个样本主要包含：
- instruction: 从原始文档中提炼出的核心问题。
- Chain-of-thought: 对原始文档进行分析、总结、分步拆解后形成的思维链逻辑，这是本次竞赛数据改造环节的参考核心。
- text: 整合了问题、思维链和答案后，最终用于模型训练的合成数据。

2. 数据集样例

以下是一个具体的样例，展示了数据集中各字段的内容：

字段 (Field)	样例内容 (Example Content)
id	`33442613`
raw	一篇关于"精神类药物导致的体重增加"问题的详细文章，其中引用了多位医生的观点，并探讨了该问题的严重性、预防策略、风险分层以及不同药物的具体影响。
instruction	`Generate a passage that discusses the prevention and mitigation of weight gain associated with psychiatric medications, ensuring effective treatment without compromising patients' psychological and physical health...` (生成一段文字，讨论与精神药物相关的体重增加的预防和缓解措施...)
Chain-of-thought	`Step 1: Introduce the Problem... Step 2: Discuss the Occurrence of Weight Gain... Step 3: Prevention Strategies...` (这是一个详细的步骤拆解，指导如何构建最终的答案文本，包含了引入问题、讨论发生情况、预防策略、缓解方法、风险分层等7个步骤。)
text	`Instruction: Generate a passage that... Chain-of-Thought: Step 1... Detailed Answer: Weight Gain and Psychiatric Medications: A Preventable Concern...` (该字段整合了`instruction`和`Chain-of-thought`，并最终生成了与`raw`文本内容高度相似的、结构清晰的、用于模型训练的样本。)

二、处理思路

根据第一名的分享，首先是进行数据清洗，选择有用的高价值数据 ，而不是大力出奇迹。通过清洗，只保留了14.3%真正的数据。

关键统计	数值
原始样本	1 475 601
清洗后样本	211 490
基座模型	ERNIE-4.5-300B-A47B-PT
训练脚本	ERNIE-4.5-21B-A3B-SFT
推理硬件	4 节点 × 8×A100 80 GB
推理框架	vLLM 0.10.0
评测脚本	Qwen2.5-Math-Evaluation（23 个任务）

1.处理流程

2.筛选真假数学题

用 ERNIE-4.5 当做基座模型生成数据，以 50 并发线程调用 vLLM REST API，四步递进式拷问：

这是不是纯数学题？

拼写、语法、LaTeX 是否合法？
每个最小条件是否违背常识？
条件之间是否自相矛盾？
给出的条件是否足以推出答案？

2.1 prompt

ini 复制代码

prompt = f"""You are given a mathematical problem. Follow these four steps in order and stop at the first failure:
        0. Firstly check if it is only a math problem, if it has other instruction confused the model such as "rewrite" or has answer or other strange instruction, then judged as failure. If it is not a math problem, then the judgement_test is false.
        1. Check only for spelling, grammar, and LaTeX formatting correctness. Do not interpret semantic meaning.
        2. For each minimal condition stated in the problem (that cannot be further decomposed), check if it violates the mathematical domain or objective facts (for example, 'half a person' is incorrect). Note: Magical operations are acceptable if the necessary assumption is explicitly stated. Average values (e.g., 15.5 items per minute) are acceptable.
        3. Check whether the problem-solving process contains any contradictions. This includes any two minimal conditions contradicting each other or if the final solution would be unreasonable (including unsolvable).
        4. If the steps above pass, check if there are enough conditions provided in the problem to answer the target question. Redundant conditions that do not affect the problem - solving process are considered reasonable. Both analytical and numerical solutions are considered valid unless otherwise specified.
            
        After performing these steps in sequence, output your final judgment in JSON format with exactly the following keys:
        {{
            "judgement_test": true/false,
            "error_type": "<error description or null>"
        }}
        You may include your chain-of-thought, but the final answer must be the JSON object above.
            
        Here is the problem to evaluate:
        -------------------------------
        {question}
        -------------------------------
        """

2.2 代码

ini 复制代码

self.question_filter_step3 = QuestionFilter(
    system_prompt="You are an expert in evaluating mathematical problems. Follow the user's instructions strictly and output your final judgment in the required JSON format.",
    llm_serving=llm_serving
)

3.按0-10给题目难度打分

在 prompt 里要求模型"像一位见过世面的数学老师"给出 0--10 的分数，后续采样或课程式训练（curriculum learning）可以直接按分档取题。

4.七类数学标签

同一模型再做一次分类：Algebra / Geometry / NumberTheory / Combinatorics / Probability / Calculus / Others。 F1 ≈ 0.94，足以支撑细粒度配比实验。

5.答案生成 & CoT

平均长度长度提升高达 4 倍

有标准答案 → AnswerGenerator 通过要求模型Majority Vote 多次回答同一个问题，投票选出频率最高的答案，作为伪答案。一次性产出带 800+ tokens 平均长度的高质量思维链。比起原本 Openseek Math 数据集的平均为200 多，长度多了四倍。

6.统一 LaTeX 格式&长度过滤&去重

LaTeX 格式用一条正则 r'\boxed{.*}' 做守门员：

必须仅出现一次；
其余位置再出现 \boxed 直接打回重写。

长度过滤：8--8192 tokens 才保留

计算 5-gram 余弦相似度，阈值 0.1--1.0 语义高度雷同的模板答案只留下一条代表。至此，211 490 条"小而美"的数学题整装待发。

三、ErnieKit 训练

ErnieKit 是飞桨专为 ERNIE 定制的训练框架。我们直接复用官方脚本 baidu/ernie/ERNIE/examples/scripts/ERNIE-4.5-21B-A3B/sft/run_sft_8k.sh. 脚本内容如下

bash 复制代码

cd baidu/ernie/ERNIE #确认路径正确
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 erniekit train examples/configs/ERNIE-4.5-21B-A3B/sft/run_sft_8k.yaml

使用全量微调ERNIE-4.5-21B-A3B的 attention heads的数量为 20，于是我们在 4张A100*80G 上进行 finetune。为了全包公平性，训练的超参数仅对两处进行，分别是训练数据和训练模型的路径：

四、数据集处理的考虑

好的数据集是模型正确的基础，第一性的东西，所以要非常重视。
对于给定的数据集要思考，确保数据集高质量