【从零开始】14. 数据评分与筛选

书接上回，上一章我们完成了"非问答"类数据的数据增强后库内增加了 200w 数据。接下来，应该继续做"数据蒸馏"从商用大模型那里提取垂直领域的数据，但这样无疑违背了"零成本"的初衷了。

那么本地部署个高参开源模型行么？

如果一开始就有 4090、5090 等高端显卡的话，那确实是个不错的选择。但可惜我没有这么优质的资源，因此数据蒸馏的工作就暂时不做了，若后期数据不足时再分批补全吧。

好了，现在库数据已经有差不多 250w+ 。接下来就是对这批数据进行"提纯"工作。

提纯分 3 次进行：

第一次针对问答对的语义清晰度（包括是否词不达意、是否无法理解、是否出现歧义，对问题的完整性、准确性、逻辑性进行评估）进行分析提炼；

第二次针对问答对的专业性进行提炼；（因的，因此要针对专业性进行评估）

值得一提的是，公开数据集和数据增强生成的数据不一定都跟垂直领域有关，且会涉及与法律许可相关的内容，这些都是需要处理的。举个例子，这次我需要做的是中药领域的 NLP 模型，那么根据国家相关规定，在数据处理时就需要规避掉跟"医疗"相关的内容（若有人根据你模型输出提示误服（或误操作）导致死亡，我是无法承担负责的）。其次，对药理、成分等方面也需要仔细审查，剔除掉跟毒麻类相关的药物、品类等数据。再者，关于古籍方剂中的古方配伍的使用要谨慎，因为并不是所有古方都经过科学论证的，且每个人的体质不一样效果不同，因此需要通过辩证角度来处理，并不是仅仅是数据层面的事情等等。

因此"专业性提纯"并不是一件简单的工作，一般来说这项工作在企业里是"一把手工程"，由领导牵头带着一批具备相关专业资格的人经过多轮人工审核把关才能完成的。目前的我只有一个人因此，我只能说这部分专业性提纯的工作尽量做吧，到时候模型输出时就加上"仅供学习、分享研究使用，不作为任何诊疗意见"的字眼咯。

第三次针对问答对的安全性进行提炼。这次提纯操作将会放在模型"训练后，出厂前"进行。通过对抗脚本进行多轮模型提问，并收集输出后利用高参模型进行"内容诊断"，若存在问题则对训练数据集进行调整并再次训练...直到达到"合格"标准。这里面所谓的"内容诊断"跟我们训练后根据数据的损失率判断训练模型的好坏不同，这里值的是直面输出内容是否达到要求。需要说明一下的是，这里的"安全性"指的是内容是否会误导用户，或者一些具有危险性（像是药物的炮制操作）的操作步骤等等。

那么问题来了，我们要如何对数据进行评价呢？评价的标准是什么？我不是专业人员是否也能做呢？

个人认为评价最直观的体现肯定是量化评分。至于标准嘛我不是专业人员我那知道呢？但是我不知道不要紧，大模型知道就可以了。同理，评分标准就是我提出的 prompt 了。

于是最终就变成了，写个自动化脚本遍历所有问答退，并使用 3 个大模型同为一条问答对进行评分，最终取平均分为最终得分（由于大模型各自的机制不一样，若只用一种大模型进行评分未免有些偏颇，既然这样我就同时用 3 个大模型进行评分，效果要好一点）。

这次的操作主要是两个 Python 脚本，伪代码如下：

python 复制代码

class ScoreAndFilterData:
    
    ...
    
    def search_for_score(self):
        # 查询 elasticsearch 获取 process_status 字段状态为 0 的记录
        search_not_ready = {
            "size": 1,
            "query": {
                "term": {
                    "process_status": {
                        "value": 0
                    }
                }
            }
        }

        # 分批处理
        results = self.elastic.find_by_body(name=CU.TMP_ES_INDEX, body=search_not_ready)
        batch_count = 1
        while len(results) >0:
            start_time = time.time()

            # 这里调用了"硅基流动"的免费模型接口进行处理
            update_entity = self._thread_to_silicon_get_score(results[0])
            if "qwen" in update_entity and "glm" in update_entity and "ds" in update_entity: 
                
                # 若接口返回的数据中同时存在"qwen"、"glm"和"ds"模型的评分，那么就可以对返回的 update_entity 内容进行解析
                logger.info(f"batch {batch_count} scored use time: {math.ceil(time.time() - start_time)}s.")
                
                # 问答对在 elasticsearch 中记录的 _id 值
                es_id = update_entity["id"]

                # 计算平均分
                avg_score = round((int(update_entity["qwen"]) + int(update_entity["glm"]) + int(update_entity["ds"]))/3, 2)
                
                # 组装更新 json 数据集
                update_data = {
                    "doc": {
                        "process_status": 1,
                        "qwen_score": update_entity["qwen"],
                        "glm_score": update_entity["glm"],
                        "ds_score": update_entity["ds"],
                        "avg_score": avg_score
                    }
                }
                logger.info(f"{es_id} update success...use time: {math.ceil(time.time() - start_time)}s.")
            else:
                # 若无法处理的也更新一个新的状态为"2"（避免卡死在同一个地方无法前进，最多之后回过头再来处理）
                update_data = {
                    "doc": {
                        "process_status": 2
                    }
                }
                logger.info(f"{es_id} update fail...use time: {math.ceil(time.time() - start_time)}s.")   
            
            # 更新索引数据
            self.elastic.update(name=CU.TMP_ES_INDEX, data=update_data, id=es_id)
            batch_count += 1 
            
            # 查询并为进入下一轮循环做准备
            results = self.elastic.find_by_body(name=CU.TMP_ES_INDEX, body=search_not_ready)
            
    def _thread_to_silicon_get_score(self, result):

        # 获得 id 值和问答对字符串
        id = result["_id"]
        gather_text = result["_source"]["gather_text"]
        
        start_time = time.time()
        
        resp_array,resp_threads = [],[]

        # 开启多线程进行处理，同时对"qwen"、"glm"和"ds"三个模型发出评分请求
        for key, params in self.content_scores.items():
            _thread = threading.Thread(
                target=self._get_and_check_digit_return,
                args=(gather_text, key, params, resp_array),
                daemon=True
                )
            resp_threads.append(_thread)
            _thread.start()
        # 等待这一条记录的所有评分返回
        for _thread in resp_threads:
            _thread.join()
        
        logger.info(f"{id}(scored),use time:{math.ceil(time.time() - start_time)}s.")    
        time.sleep(random.randint(2,5))    

        # 收集返回信息并组装返回 json
        resp_entity = {"id": id}
        if resp_array:
            for resp in resp_array:
                resp_entity.update(resp)
        return resp_entity
    
    def _get_and_check_digit_return(self, gather_text, key, params, resp_array):
        
        # 失败重试计数器
        digit_batch_count = 0
        
        # 设定重试 3 次，若 3 次重试都无法完成，则直接返回
        while digit_batch_count < 3:
            try:
                start_time = time.time()
                score = self.api.chat_with_sync(params, CU.get_score_prompts(gather_text))
                
                # 清楚返回内容"非数字内容"
                score = re.sub(r'[^\w\u4e00-\u9fff]', '', score)
                
                # 再次检查返回的是否数字
                if score.isdigit():
                    logger.info(f"LLM({key}) score:{score},use time:{math.ceil(time.time() - start_time)}s.")
                    resp_array.append({key: int(score)})
                    time.sleep(1)
                    break
                else:
                    logger.info(f"Not digit detected,Next round to fix it.")
                    time.sleep(random.randint(5,10))
                    digit_batch_count += 1
            except Exception:
                logger.info(f"Exception detected,Next round to fix it.")
                time.sleep(random.randint(5,10))
                digit_batch_count += 1
        
if __name__ == "__main__":
    s = ScoreAndFilterData()
    s.search_for_score()

在上述代码中使用了 get_score_prompts 函数来组装模型提示词，具体提示词如下：

python 复制代码

def get_score_prompts(qa_content):
    return f"""
        我将提供一条中医药领域的"问答对"（包含问题和回答）。  
        你的任务是：  
        1. 只根据问答对的完整性、准确性、逻辑性和专业性进行质量评估。  
        2. 给出一个 **0 到 10 之间的分数**（10 分表示极高质量，0 分表示极低质量）。  
        3. 只返回一个阿拉伯数字，不要输出任何解释或其他内容。  

        问答对如下：  
        
        {qa_content}

        请直接输出一个 0-10 的整数，不要输出任何解释或符号。
    """

稍微提示词稍微简单了一点，但是能用就行。

好了，现在主程序已经完成了，后面还需要以一个定时器的方式启动它（毕竟250w+的数据并不能保证一次性能够完成）。于是有了以下代码：

python 复制代码

class CheckByFileModify:
    
    ...

    def start_script(self):
        # 在启动 python 脚本之前先检查是否存在之前的进程，若有就先 kill 掉之前的进程
        if self.process:
            self.kill_process()

        logger.info(f"启动脚本: {self.script_path}")
        try:
            # 使用 sys.executable 去启动制定额 python 脚本
            self.process = subprocess.Popen([
                sys.executable,
                str(self.script_path)
            ])
            logger.info(f"进程已启动，PID: {self.process.pid}")

            # 由于需要监测日志文件是否有更新，所以这里需要先判断日志文件是否存在
            if self.file_path.exists():
                
                # 记录日志文件最新的更新时间
                self.last_modified_time = self.file_path.stat().st_mtime

        except Exception as e:
            logger.error(f"启动脚本失败: {e}")
            self.process = None

    def kill_process(self):

        # 如果存在进程并且进程池是空的情况下执行
        if self.process and self.process.poll() is None:
            logger.info(f"正在终止进程 PID: {self.process.pid}")
            try:
                
                # 发出终止指令
                self.process.terminate()
                
                # 等待进程退出
                self.process.wait(timeout=5)
                
            except subprocess.TimeoutExpired:
                
                # 如果等待 5 秒都未能退出的情况下直接 kill 掉（优雅）
                logger.error("进程未响应，强制终止")
                self.process.kill()
                self.process.wait()
            except Exception as e:
                # 输出错误退出的内容
                logger.error(f"终止进程时出错: {e}")
                
        # 重置 process 变量
        self.process = None

    def check_file_activity(self):
        
        try:

            # 先判断日志文件是否存在先，如果不存在直接退出
            if not self.file_path.exists():
                logger.info(f"文件不存在: {self.file_path}")
                return False

            # 获取最新文件修改时间
            current_modified_time = self.file_path.stat().st_mtime
            
            # 用最新的文件修改时间跟内存中之前记录的文件修改时间进行对比判断进程是否正常运行
            has_activity = current_modified_time > self.last_modified_time
            if has_activity:
                logger.info(f"检测到文件更新: {self.file_path}")
                # 如果进程正常运行则更新最新修改时间
                self.last_modified_time = current_modified_time
            return has_activity
        except Exception as e:
            logger.error(f"检查文件时出错: {e}")
            return False

    def monitor(self):

        # 启动程序时先执行一次
        self.start_script()
        try:
            # 做循环不断监测
            while True:
                time.sleep(self.check_interval)

                if self.process and self.process.poll() is not None:
                    logger.info(f"检测到进程已退出 (退出码: {self.process.returncode})")
                    self.start_script()
                    continue

                if not self.check_file_activity():
                    logger.info("检测到长时间无输出，重启进程...")
                    self.start_script()

        except KeyboardInterrupt:
            logger.error("\n收到中断信号，正在清理...")
            self.kill_process()

if __name__ == "__main__":

    # 为了方便使用 shell 脚本启动，做成传参的方式
    # 下面的代码就不说了，大致都能够看懂
    parser = argparse.ArgumentParser(description='监控文档输出并自动重启进程')
    parser.add_argument('file_path', help='要监控的文档路径')
    parser.add_argument('script_path', help='需要重启的Python脚本路径')
    parser.add_argument('check_interval', type=int, help='检查间隔时间（秒）')

    args = parser.parse_args()

    if not Path(args.script_path).exists():
        logger.error(f"错误: 脚本文件不存在 - {args.script_path}")
        sys.exit(1)

    monitor = CheckByFileModify(
        file_path=args.file_path,
        check_interval=args.check_interval,
        script_path=args.script_path
    )

    monitor.monitor()

因为我的代码如果有正常执行就肯定会有输出日志，有输出日志的话日志文档肯定会更新，如果有更新的话就肯定在运行。因此我写了这样的监控程序来让它 7 x 24 运行。

至于后面根据平均分字段进行删除的就不再将代码贴上来了，各位有兴趣可以到我的 gitee 或者 github 仓库中查阅吧（就根据所有记录"平均分"字段汇总做一个总的平均分，低于分数线的记录直接删除就是了，没有什么可说的）。

gitee：gitee.com/yzh0623/bra...

github：github.com/yzh0623/bra...

我在考虑关于"专业性"提纯是否有需要给各位分享，毕竟也是大同小异，通过调用高参大模型对内容进行评分、筛选、删除、重写而已。并且每个行业的情况不一样，感觉没有分享的必要了。那么下一章我们将开始做微调了，敬请留意。

（未完待续...）