【从零开始】12. 一切回归原点

各位新老朋友，好久不见了。

距最后一次更新已有差不多一年的时间了。这期间发生了很多事儿，一度让我走进了人生低谷。现在，一切都已经过去，热爱分享、与君共勉仍是我的初心。一切都"从零开始"吧，这样也不错。

言归正传，由于不可抗力的因素"番外篇"就此结束。接下来我将会以"零成本"为目标跟大家一起"搓"一个简单的中药领域 NLP 模型出来，之前未能分享给大家的，接下来将结合新优化一并公开分享。

此外，由于 RTX 1060 显卡已被家人征用，因此本次分享将另辟蹊径。整条技术线将以 CPU 为推理单元完成，望周知。

搭建环境

重构 brain-mix 项目并以此为起点进行环境的搭建。使用 miniconda 作为环境管理，创建一个名为 brain_mix 的环境，python 版本为 3.10.15。

bash 复制代码

(base) yuanzhenhui@MacBook-Pro ~ % conda create -n brain_mix python==3.10.15
Retrieving notices: ...working... done
Channels:
 - defaults
Platform: osx-64
Collecting package metadata (repodata.json): done
...

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
...

之后可以对环境进行一些必要的安装。如下图：

bash 复制代码

(base) yuanzhenhui@MacBook-Pro ~ % conda activate brain_mix
(brain_mix) yuanzhenhui@MacBook-Pro ~ % python -m pip install --upgrade pip
Looking in indexes: https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple, https://mirrors.aliyun.com/pypi/simple/
Requirement already satisfied: pip in ./Documents/anaconda3/envs/brain_mix/lib/python3.10/site-packages (25.1)
Collecting pip
  Downloading https://mirrors.aliyun.com/pypi/packages/b7/3f/945ef7ab14dc4f9d7f40288d2df998d1837ee0888ec3659c813487572faa/pip-25.2-py3-none-any.whl (1.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 4.8 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 25.1
    Uninstalling pip-25.1:
      Successfully uninstalled pip-25.1
Successfully installed pip-25.2

(brain_mix) yuanzhenhui@MacBook-Pro ~ % pip install 'openvino-genai==2025.2.0'
Looking in indexes: https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple, https://mirrors.aliyun.com/pypi/simple/
Collecting openvino-genai==2025.2.0
  Downloading https://mirrors.aliyun.com/pypi/packages/28/59/ccc191f98aea661f11e62b98923e08ac29e654464cec9c5a67a347327b94/openvino_genai-2025.2.0.0-cp310-cp310-macosx_10_15_x86_64.whl (3.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 4.2 MB/s  0:00:00
Collecting openvino_tokenizers~=2025.2.0.0.dev (from openvino-genai==2025.2.0)
  Downloading https://mirrors.aliyun.com/pypi/packages/3e/71/9670f4aad6840121851077a6461592e9abd6aa197b75ad6852f328973aba/openvino_tokenizers-2025.2.0.1-py3-none-macosx_10_15_x86_64.whl (13.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.9/13.9 MB 2.4 MB/s  0:00:05
Collecting openvino~=2025.2.0.dev (from openvino_tokenizers~=2025.2.0.0.dev->openvino-genai==2025.2.0)
  Downloading https://mirrors.aliyun.com/pypi/packages/0a/79/ed97d848e951c574535768a099c00d283aa4ac2dac652ae29d03591a8ae5/openvino-2025.2.0-19140-cp310-cp310-macosx_10_15_x86_64.whl (38.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.4/38.4 MB 2.3 MB/s  0:00:16
Collecting numpy<2.3.0,>=1.16.6 (from openvino~=2025.2.0.dev->openvino_tokenizers~=2025.2.0.0.dev->openvino-genai==2025.2.0)
  Downloading https://mirrors.aliyun.com/pypi/packages/7a/4f/1cb5fdc353a5f5cc7feb692db9b8ec2c3d6405453f982435efc52561df58/numpy-2.2.6-cp310-cp310-macosx_14_0_x86_64.whl (6.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.9/6.9 MB 2.7 MB/s  0:00:02
Collecting openvino-telemetry>=2023.2.1 (from openvino~=2025.2.0.dev->openvino_tokenizers~=2025.2.0.0.dev->openvino-genai==2025.2.0)
  Downloading https://mirrors.aliyun.com/pypi/packages/3b/ac/5ab0ca0aa269ad3c73f7bfc3801b10e5f56f75a31bf68c1ae8bd51cf70a4/openvino_telemetry-2025.2.0-py3-none-any.whl (25 kB)
Collecting packaging (from openvino~=2025.2.0.dev->openvino_tokenizers~=2025.2.0.0.dev->openvino-genai==2025.2.0)
  Downloading https://mirrors.aliyun.com/pypi/packages/20/12/38679034af332785aac8774540895e234f4d07f7545804097de4b666afd8/packaging-25.0-py3-none-any.whl (66 kB)
Installing collected packages: openvino-telemetry, packaging, numpy, openvino, openvino_tokenizers, openvino-genai
Successfully installed numpy-2.2.6 openvino-2025.2.0 openvino-genai-2025.2.0.0 openvino-telemetry-2025.2.0 openvino_tokenizers-2025.2.0.1 packaging-25.0

(brain_mix) yuanzhenhui@MacBook-Pro ~ % pip install --upgrade --upgrade-strategy eager "optimum[openvino]"

如上图所示，由于采用 CPU 进行推理，因此选用 OpenVINO 推理实现。（关于 OpenVINO 在 AIGC 系列文章中提到，可查阅以下链接）。

【AIGC】Mac Intel 本地 LLM 部署经验汇总（CPU Only）_llm部署 cpu加载-CSDN博客

数据集获取

训练数据我将从 Modelscope 获取开源数据集，如下：

传统中医SFT数据集（huangxp/hwtcm-sft-v1）

plain 复制代码

(brain_mix) yuanzhenhui@MacBook-Pro ~ % modelscope download --dataset huangxp/hwtcm-sft-v1 --local_dir /Users/yuanzhenhui/Documents/modelscope/Datasets/hwtcm-sft-v1 
/Users/yuanzhenhui/Documents/anaconda3/envs/brain_mix/lib/python3.12/site-packages/modelscope/utils/plugins.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
...
Successfully Downloaded from dataset huangxp/hwtcm-sft-v1.

ShenNong大模型-中医对话数据(xiaofengalg/ShenNong_TCM_Dataset)

plain 复制代码

(brain_mix) yuanzhenhui@MacBook-Pro ~ % modelscope download --dataset xiaofengalg/ShenNong_TCM_Dataset --local_dir /Users/yuanzhenhui/Documents/modelscope/Datasets/shennong-tcm                           
/Users/yuanzhenhui/Documents/anaconda3/envs/brain_mix/lib/python3.12/site-packages/modelscope/utils/plugins.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
...
Downloading [ChatMed_TCM-v0.2.json]: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 105M/105M [00:53<00:00, 2.27MB/s]
Successfully Downloaded from dataset xiaofengalg/ShenNong_TCM_Dataset.

基于DeepSeek蒸馏的传统中医SFT数据集(huangxp/hwtcm-deepseek-r1-distill-data)

plain 复制代码

(brain_mix) yuanzhenhui@MacBook-Pro ~ % modelscope download --dataset huangxp/hwtcm-deepseek-r1-distill-data --local_dir /Users/yuanzhenhui/Documents/modelscope/Datasets/hwtcm-deepseek
/Users/yuanzhenhui/Documents/anaconda3/envs/brain_mix/lib/python3.12/site-packages/modelscope/utils/plugins.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
...
Successfully Downloaded from dataset huangxp/hwtcm-deepseek-r1-distill-data.███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49.4M/49.4M [00:24<00:00, 2.24MB/s]

注意：数据集获取时要留意开源协议，用于商业用途的要留意版权问题了（以上数据集均属于 Apache License 2.0 开源协议）。

数据集入库、过滤

开源数据集下载后并不能马上使用，里面会存在大量没有经过审核的内容，并且数据颗粒度大几率是没有对齐的，因此还需要先经过清洗和过滤（就像你市场买菜一样，回家不先洗一下是一个道理的）。为了后续方便清洗，这里会先将数据集进行 Elasticsearch 入库（这里会使用到 Elasticsearch 8.x 的向量字段功能，后面会根据向量字段来判断数据的相似性，从而进行数据排除）。

关于 Elasticsearch 的安装可以参考以下文章。

【Docker】Elasticsearch 8.12 安装与搭建-CSDN博客

当 Elasticsearch 安装完成后就可以编写 python 脚本对数据进行插入操作了，伪代码如下：

python 复制代码

class LoadAndSaveToEs:
    
    def __init__(self):

        # 读取配置信息
        nlp_cnf = os.path.join(project_dir, 'resources', 'config', 'nlp_cnf.yml')
        base_path = YamlUtil(nlp_cnf).get_value('datasets.base_path')
        dirs_name = YamlUtil(nlp_cnf).get_value('datasets.dir_name')
        
        # 将数据集路径以字典的方式加载到内存
        self.datasets_path = {}
        for dir_name in dirs_name:
            folder_path = os.path.join(base_path, dir_name)
            self.datasets_path[dir_name] = self._get_unique_file_paths(folder_path)

        # 初始化 elasticsearch 客户端和本地 bge-large-zh-v1.5 模型接口初始化
        self.elastic = ElasticUtil()
        self.embedding = DataEmbedding()

        # 获取 elasticsearch 临时索引字段映射配置并调用 create_index 函数创建索引
        es_gather_qa_mapping = YamlUtil(nlp_cnf).get_value('datasets.tmp_gather_db')
        self.elastic.create_index(name=TMP_ES_INDEX, mapping=es_gather_qa_mapping[TMP_ES_INDEX])
    
    def save_hwtcm_deepseek_data(self):
        """
        保存 hwtcm-deepseek 数据集内容到 elasticsearch
        """
        # 通过文档路径获取文档内所有 json 数据
        dataset_array = self._load_json_file(self.datasets_path["hwtcm-deepseek"][0])
        # 对 json 数据进行适配操作
        qa_array = []
        for dataset in dataset_array:
            question = dataset["instruction"]
            answer = dataset["think"]
            answer = answer.split("</think>")[1]
            answer = answer.strip().replace("\n", " ")
            qa_array.append({
                "question": question, 
                "answer": answer,
                "data_source": "hwtcm-deepseek"
                })
        # 批量保存到 elasticsearch
        self._save_to_es(qa_array)
        
    def save_hwtcm_sft_data(self):
        """
        保存 hwtcm-sft-v1 数据集内容到 elasticsearch
        """
        # 与 save_hwtcm_deepseek_data 函数相似，篇幅问题不再详细展示
        
    def save_shennong_tcm_data(self):
        """
        保存 shennong-tcm 数据集内容到 elasticsearch
        """
        # 与 save_hwtcm_deepseek_data 函数相似，篇幅问题不再详细展示
    
    def _save_to_es(self,qa_array):
        """
        数据批量保存到 elasticsearch
        """
        if qa_array:
            for qa_json in tqdm(qa_array, desc="Now changing to vectors..."):
                # 组装长字段
                qa_json["gather_text"] = f"【问题】{qa_json['question']}【答案】{qa_json['answer']}"
                # 对长字段进行向量化处理（1024 维度）
                qa_json["gather_vector_1024"] = self.embedding.array_to_embedding([qa_json["gather_text"]])[0]
                # 设置处理状态
                qa_json["process_status"] = 0
            self.elastic.batch_insert(TMP_ES_INDEX,qa_array)
    
    def _get_unique_file_paths(self, directory):
        """
        通过名字获取文档路径，由于每个文件夹下只有一个 json 文件，因此这里没有做更多的特殊处理
        """
        unique_files = set()
        for path in Path(directory).rglob('*'):
            if path.is_file() and path.suffix == '.json':
                unique_files.add(str(path))
        return list(unique_files)
    
    def _load_json_file(self, file_path):
        """
        加载 json 数据
        """
        with open(file_path, 'r', encoding='utf-8') as file:
            data = json.load(file)
        return data

if __name__ == "__main__":
    laste = LoadAndSaveToEs()
    laste.save_hwtcm_deepseek_data()
    laste.save_hwtcm_sft_data()
    laste.save_shennong_tcm_data()

通过以上函数对本地数据进行"粗加工"，但发现由于采用的是 CPU 计算，因此 1024 维度的向量转换非常慢，一度达到了 1.18s/it

bash 复制代码

Now changing to vectors...:   4%|████▊    | 2184/56283 [52:53<17:43:18,  1.18s/it]

既然这样就请出了我们的外援"硅基流动"。

感谢硅基流动免费提供嵌入模型接口调用，于是稍微修改一下上面的代码，最终的代码如下：

python 复制代码

class LoadAndSaveToEs:

    def __init__(self):

        ...

        # 增加 titoken 插件计算 token（因为硅基流动接口只接受输入 512 token 的内容）
        self.enc = tiktoken.get_encoding("cl100k_base")

        # 写了一个专门用于对接硅基流动的 api 工具类
        self.api = ApiUtil()
        utils_cnf = os.path.join(project_dir, 'resources', 'config', 'utils_cnf.yml')
        self.embedding_model = YamlUtil(utils_cnf).get_value('silicon.agent.content_embedding.model')

    def _save_to_es(self,qa_array):

        if qa_array:

            # 由于有更多的算力支持，因此需要将总数据量拆分为 5 份，其中调用硅基流动的站 4 份
            qa_split_array = CommonUtil.split_array(qa_array,5)
            ...
            # 对数据拆分进行了重组
            qa_split_result = [
                qa_split_array[0],
                [item for sublist in qa_split_array[1:] for item in sublist]
            ]

            ...

    def _thread_to_get_vectors(self,qa_split,qa_batch_array,flag):

        for qa_json in tqdm(qa_split, desc="Now changing to vectors..."):
            ...

            if flag == 0:
                qa_json["gather_vector_1024"] = CommonUtil.request_embedding(qa_json["gather_text"])
            else:

                # 若输入 token 小于 512 则提交给硅基，否则本地处理
                content = qa_json["gather_text"]
                if len(self.enc.encode(content))< 512:
                    qa_json["gather_vector_1024"] = self.api.embedding_with_sync(self.embedding_model,[content])[0]
                else:
                    qa_json["gather_vector_1024"] = self.embedding.array_to_embedding([content])[0]

            ...

速度总算上来了。

经过 2 x 24 小时的"苦战"数据终于全部入库，之后就可以进行数据的第一遍清洗工作了。伪代码如下：

python 复制代码

class DeleteLowQualityData:

    ...

    def delete_dulpicate_data(self):
        """
        删除重复记录（在数据导入的过程中难免会出现反复导入的情况）
        """
        # elasticsearch 中支持 sql 查询因此可以直接用 sql 语句来分组查询
        search_sql = f"select question,answer from {TMP_ES_INDEX} group by question,answer having count(1) > 1"
        results = self.elastic.find_by_sql(sql=search_sql)
        if results:
            response_array = results.body["rows"]

            # 获取了数据集后进行遍历
            for response in tqdm(response_array, desc="Now delete dulpicate data..."):

                # 采用 DSL 进行二次查询目的是为了获取 "_id" 字段进行删除
                search_single_body = {
                    "query": {
                        "bool": {
                            "must": [
                                {"term": {"question": {"value": response[0]}}},
                                {"term": {"answer": {"value": response[1]}}}
                            ]
                        }
                    }
                }

                dsl_results = self.elastic.find_by_body_nopaging(name=TMP_ES_INDEX, body=search_single_body)

                # 根据查询结果进行遍历删除，只保留一条记录
                for idx, dsl_result in enumerate(dsl_results):
                    if idx > 0:
                        self.elastic.delete_by_id(name=TMP_ES_INDEX, id=dsl_result["_id"])

    def delete_similar_data(self):
        """
        删除相似数据（这里将采用向量字段进行判断）
        """
        self.dh.find_and_remove_duplicate_vectors(
            self.delete_conn,
            TMP_ES_INDEX,
            vector_field="gather_vector_1024",
            text_field="gather_text",
            similarity_threshold=0.95
        )

# 启用定时器每 30 分钟、90 分钟检查一次（待数据稳定后即可废弃）
dlqd = DeleteLowQualityData()        
schedule.every(30).minutes.do(dlqd.delete_dulpicate_data)
schedule.every(90).minutes.do(dlqd.delete_similar_data)

if __name__ == "__main__":

    dlqd.delete_dulpicate_data()
    dlqd.delete_similar_data()

    while True:
        schedule.run_pending()
        time.sleep(1)

从上面的代码可以看出，我封装了一个名为"find_and_remove_duplicate_vectors"的函数去删除相似的数据。这个函数里面包含什么东西呢？其实啊，该函数主要使用了 DBSCAN 聚类算法来识别相似向量（cosine 相似度高于阈值），并保留最佳文档（根据时间或文本长度），其余标记为重复项并删除。具体处理流程如下：详细的代码可以参考 brain-mix 项目中的 clean_util.py 代码。

以上代码均发布到 brain-mix 项目中，欢迎各位的指导。

gitee: gitee.com/yzh0623/bra...

github:github.com/yzh0623/bra...

下一章将继续讲解数据增强处理，敬请留意。

（未完待续...）