基于Spark AI的进行模型微调(DataWhale AI夏令营)

前言

Hello，大家好，我是GISer Liu😁，一名热爱AI技术的GIS开发者，本文参与活动是2024 DataWhale AI夏令营第四期 大模型微调 希望我的文章能帮助到你；😲

简介

在本文中，作者将详细讲解如何从零开始构建一个语文和英语高考选择题数据集，并且基于讯飞开发平台进行LLM微调训练一个模型，最后通过调用API进行测试。我们将分为以下几个步骤：

数据集准备：包括数据的读取、预处理、问题提取与答案提取。
模型训练：如何利用现有的语言模型，进行定制化的模型训练。
本地测试：训练后的模型在本地如何测试，包括如何与模型交互，验证模型的准确性。

一、数据集准备

在进行模型训练之前，首先需要准备高质量的数据集。这里的数据集由两部分构成：语文与英语高考选择题的数据集。

1.1 读取与预处理数据

首先，我们需要将原始的Excel文件数据加载到内存中，并对其中的一些字符进行替换操作，以确保数据格式的一致性。

python 复制代码

# !pip install pandas openpyxl  # 没有安装需要取消注释后安装一下
import pandas as pd
import re

# 读取Excel文件
df = pd.read_excel('训练集-语文.xlsx')
df = df.replace('．', '.', regex=True)  # 将全角句号替换为半角句号
df = df.replace('（', '(', regex=True)  # 将全角左括号替换为半角左括号

# 读取第二行（即第三行）"选项"列的内容
second_row_option_content = df.loc[2, '选项']
# 显示第二行"选项"列的内容
print(second_row_option_content)

这里查看一下数据

1.2 提取选择题内容

为了提取选择题中的问题和选项，我们使用正则表达式来匹配问题和选项的格式。这里的 chinese_multiple_choice_questions 函数实现了这个过程。

python 复制代码

def chinese_multiple_choice_questions(questions_with_answers):
    # 输入的题目文本
    text = questions_with_answers

    # 定义问题和选项的正则表达式模式
    question_pattern = re.compile(r'\d+\..*?(?=\d+\.|$)', re.DOTALL)
    choice_pattern = re.compile(r'([A-D])\s*(.*?)(?=[A-D]|$|\n)', re.DOTALL)

    # 找到所有问题
    questions = question_pattern.findall(text)

    # 初始化选择题和简答题列表
    multiple_choice_questions = []
    short_answer_questions = []

    # 处理每个问题
    for id, question in enumerate(questions):
        # 检查是否是选择题
        if re.search(r'[A-D]', question):
            choices = choice_pattern.findall(question)  # 提取选项
            question_text = re.split(r'\n', question.split('(')[0])[0]  # 提取问题文本

            # 将问题与选项整理成字典
            multiple_choice_questions.append({
                'question': f"{id+1}.{question_text.strip()}",
                'choices': choices
            })
        else:
            short_answer_questions.append(question.strip())  # 处理简答题
    return multiple_choice_questions

这个函数的作用是将输入的文本分割成每一个问题，并提取其中的选项和对应的内容，最终输出为一个包含问题和选项的列表。

下面我们对问题进行提取：

python 复制代码

questions_list = []
for data_id in range(len(df[:3])):
    second_row_option_content = df.loc[data_id, '选项']
    questions_list.append(chinese_multiple_choice_questions(second_row_option_content))

1.3 提取答案

为了从数据中提取正确答案，我们定义了 chinese_multiple_choice_answers 函数，通过正则表达式从文本中匹配出每个问题的答案。

python 复制代码

def chinese_multiple_choice_answers(questions_with_answers):
    questions_with_answers = questions_with_answers.replace(" ", "").replace("\n", "")

    # 使用正则表达式匹配答案
    choice_pattern = re.compile(r'(\d+)\.([A-Z]+)')
    short_pattern = re.compile(r'(\d+)\.([^A-Z]+)')

    # 找到所有匹配的答案
    choice_matches = choice_pattern.findall(questions_with_answers)
    short_matches = short_pattern.findall(questions_with_answers)

    # 将匹配结果转换为字典
    choice_answers = {int(index): answer for index, answer in choice_matches}
    short_answers = {int(index): answer for index, answer in short_matches}

    # 按序号重新排序
    sorted_choice_answers = sorted(choice_answers.items())
    sorted_short_answers = sorted(short_answers.items())

    answers = []
    for id in range(len(sorted_choice_answers)):
        answers.append(f"{id+1}. {sorted_choice_answers[id][1]}")
    return answers

这里我们提取答案进行测试：

python 复制代码

# 读取第二行（即第三行）"选项"列的内容
second_row_option_content = df.loc[60, '答案']
# 显示第二行"选项"列的内容
print(second_row_option_content)
chinese_multiple_choice_answers(second_row_option_content)

构建答案字段：

python 复制代码

df['答案_processed'] = df['答案'].map(chinese_multiple_choice_answers)

1.4 构建提示词打包函数

python 复制代码

def get_prompt_cn(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。 阅读文本主要是中文，你出的题目需要满足以下要点，紧扣文章内容且题干和答案为中文：
    
    ### 回答要求
    (1)理解文中重要概念的含义
    (2)理解文中重要句子的含意
    (3)分析论点、论据和论证方法
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt

1.5 构建中文数据集

通过调用上述函数，我们可以构建最终用于训练的数据集。在这个过程中，我们将所有的问题和答案格式化为所需的输入输出形式，并生成适用于模型训练的 prompt。

python 复制代码

def process_cn(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        
        data_options = chinese_multiple_choice_questions(data_options)
        data_answers = chinese_multiple_choice_answers(data_answers)
        data_prompt = get_prompt_cn(data_prompt)
        
        if len(data_answers) == len(data_options):
            res = ''
            for id_, question in enumerate(data_options):
                res += f"{question['question']}?\n"
                for choice in question['choices']:
                    res += f"{choice[0]}. {choice[1]}\n"
                res += f"答案: {data_answers[id_].split('.')[-1]}\n"
            res_output.append(res)
            res_input.append(data_prompt)
    return res_input, res_output

cn_input, cn_output = process_cn(df)

如此一来，我们将每一行数据提取出问题和答案，并根据需要构建出模型所需的输入（input）和输出（output）。

1.6 构建英文数据集

同理，我们可以构建英文数据集，逻辑类似，完整代码如下：

python 复制代码

import pandas as pd
import re

# 读取Excel文件并对数据进行预处理
df = pd.read_excel('训练集-英语.xlsx')

# 替换一些特殊符号，使其标准化
df = df.replace('．', '.', regex=True) \
       .replace('А.', 'A.', regex=True) \
       .replace('В.', 'B.', regex=True) \
       .replace('С.', 'C.', regex=True) \
       .replace('D.', 'D.', regex=True)

def remove_whitespace_and_newlines(input_string):
    # 使用str.replace()方法删除空格和换行符
    result = input_string.replace(" ", "").replace("\n", "").replace(".", "")
    return result

# 定义函数用于从答案列中提取答案
def get_answers(text):
    # 删除空格和换行符
    text = remove_whitespace_and_newlines(text)
    
    # 正则表达式模式，用于匹配答案
    pattern = re.compile(r'(\d)\s*([A-D])')

    # 查找所有匹配项
    matches = pattern.findall(text)
    res = []
    
    # 遍历所有匹配项，将答案存入列表
    for match in matches:
        number_dot, first_letter = match
        res.append(first_letter)
    return res

# 示例输入，测试get_answers函数
input_string = "28. A. It is simple and plain. 29. D. Influential. 30. D.33%. 31. B. Male chefs on TV programmes."
res = get_answers(input_string)
print(res)  # 输出提取出的答案列表

# 定义函数用于从问题列中提取问题和选项
def get_questions(text):
    # 替换换行符并在末尾添加空格
    text = text.replace('\n', '  ')+'  '
    
    # 正则表达式模式，用于匹配问题和选项
    pattern = re.compile(r'(\d+\..*?)(A\..*?\s{2})([B-D]\..*?\s{2})([B-D]\..*?\s{2})(D\..*?\s{2})', re.DOTALL)

    # 查找所有匹配项
    matches = pattern.findall(text)

    # 存储结果的字典列表
    questions_dict_list = []

    # 遍历所有匹配项，提取问题和选项
    for match in matches:
        question, option1, option2, option3, option4 = match
        
        # 提取问题文本
        pattern_question = re.compile(r'(\d+)\.(.*)')
        question_text = pattern_question.findall(question.strip())[0][1]
        
        # 提取选项字母和内容
        options = {option1[0]: option1, option2[0]: option2, option3[0]: option3, option4[0]: option4}
        
        # 将问题和选项存入字典
        question_dict = {
            'question': question_text,
            'options': {
                'A': options.get('A', '').strip(),
                'B': options.get('B', '').strip(),
                'C': options.get('C', '').strip(),
                'D': options.get('D', '').strip()
            }
        }
        questions_dict_list.append(question_dict)
    
    return questions_dict_list

# 调用get_questions函数并打印结果
questions = get_questions(text)
for q in questions:
    print(q)  # 输出提取出的每个问题及其选项

# 定义函数生成用于模型训练的提示文本
def get_prompt_en(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。
The reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points:
    
    ### 回答要求
    (1)Understanding the main idea of the main idea.
    (2)Understand the specific information in the text.
    (3)infering the meaning of words and phrases from the context
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt   

# 定义处理整个数据集的函数
def process_en(df): 
    res_input = []
    res_output = []
    
    # 遍历数据集中的每一行
    for id in range(len(df)):
        # 提取选项、答案和阅读文本
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        
        # 调用前面定义的函数处理选项和答案
        data_options = get_questions(data_options)
        data_answers = get_answers(data_answers)
        data_prompt = get_prompt_en(data_prompt)
        
        # 确保答案和问题数量一致
        if(len(data_answers) == len(data_options)):
            res = ''
            # 遍历每个问题，生成最终格式的文本
            for id, question in enumerate(data_options):
                res += f'''
                {id+1}.{question['question']}
                {question['options']['A']}
                {question['options']['B']}
                {question['options']['C']}
                {question['options']['D']}
                answer:{data_answers[id]}
                '''+'\n'
            res_output.append(res)
            res_input.append(data_prompt)
    
    return res_input, res_output

# 处理数据集
en_input, en_output = process_en(df)

1.7 数据集合并

我们将构建的中文数据集和英文数据集进行合并，用于后续处理导出：

python 复制代码

# 将两个列表转换为DataFrame
df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})

df_new

可以看到数据已经导出成功！

完整代码如下：

python 复制代码

import pandas as pd
import re
import json

# 通用函数：删除空格和换行符
def remove_whitespace_and_newlines(input_string):
    result = input_string.replace(" ", "").replace("\n", "").replace(".", "")
    return result

# 通用函数：提取答案
def get_answers(text):
    text = remove_whitespace_and_newlines(text)
    pattern = re.compile(r'(\d)\s*([A-D])')
    matches = pattern.findall(text)
    res = []
    for match in matches:
        number_dot, first_letter = match
        res.append(first_letter)
    return res

# 通用函数：提取问题和选项
def get_questions(text):
    text = text.replace('\n', '  ')+'  '
    pattern = re.compile(r'(\d+\..*?)(A\..*?\s{2})([B-D]\..*?\s{2})([B-D]\..*?\s{2})(D\..*?\s{2})', re.DOTALL)
    matches = pattern.findall(text)
    questions_dict_list = []

    for match in matches:
        question, option1, option2, option3, option4 = match
        pattern_question = re.compile(r'(\d+)\.(.*)')
        question_text = pattern_question.findall(question.strip())[0][1]
        options = {option1[0]: option1, option2[0]: option2, option3[0]: option3, option4[0]: option4}
        
        question_dict = {
            'question': question_text,
            'options': {
                'A': options.get('A', '').strip(),
                'B': options.get('B', '').strip(),
                'C': options.get('C', '').strip(),
                'D': options.get('D', '').strip()
            }
        }
        questions_dict_list.append(question_dict)
    
    return questions_dict_list

# 生成英文提示文本
def get_prompt_en(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。
The reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points:
    
    ### 回答要求
    (1)Understanding the main idea of the main idea.
    (2)Understand the specific information in the text.
    (3)infering the meaning of words and phrases from the context
    
    ### 阅读文本
    {text}
    '''
    return prompt

# 处理英文数据集
def process_en(df):
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id, '答案']
        data_prompt = df.loc[id, '阅读文本']
        data_options = get_questions(data_options)
        data_answers = get_answers(data_answers)
        data_prompt = get_prompt_en(data_prompt)
        
        if len(data_answers) == len(data_options):
            res = ''
            for id, question in enumerate(data_options):
                res += f'''
                {id+1}.{question['question']}
                {question['options']['A']}
                {question['options']['B']}
                {question['options']['C']}
                {question['options']['D']}
                answer:{data_answers[id]}
                '''+'\n'
            res_output.append(res)
            res_input.append(data_prompt)
    return res_input, res_output

# 读取并处理英文数据集
df_en = pd.read_excel('训练集-英语.xlsx')
df_en = df_en.replace('．', '.', regex=True) \
             .replace('А.', 'A.', regex=True) \
             .replace('В.', 'B.', regex=True) \
             .replace('С.', 'C.', regex=True) \
             .replace('D.', 'D.', regex=True)

en_input, en_output = process_en(df_en)


# 生成中文提示文本
def get_prompt_cn(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。
The reading text is mainly in Chinese. The questions and answers you raised need to be completed in Chinese for at least the following points:
    
    ### 回答要求
    (1)理解文章的主要意思。
    (2)理解文章中的具体信息。
    (3)根据上下文推断词语和短语的含义。
    
    ### 阅读文本
    {text}
    '''
    return prompt

# 处理中文数据集
def process_cn(df):
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id, '答案']
        data_prompt = df.loc[id, '阅读文本']
        data_options = get_questions(data_options)
        data_answers = get_answers(data_answers)
        data_prompt = get_prompt_cn(data_prompt)
        
        if len(data_answers) == len(data_options):
            res = ''
            for id, question in enumerate(data_options):
                res += f'''
                {id+1}.{question['question']}
                {question['options']['A']}
                {question['options']['B']}
                {question['options']['C']}
                {question['options']['D']}
                answer:{data_answers[id]}
                '''+'\n'
            res_output.append(res)
            res_input.append(data_prompt)
    return res_input, res_output

# 读取并处理中文数据集
df_cn = pd.read_excel('训练集-中文.xlsx')
cn_input, cn_output = process_cn(df_cn)


# 数据集整合
df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})

# 数据集格式转换导出
# 打开一个文件用于写入 JSONL，并设置编码为 UTF-8
with open('output.jsonl', 'w', encoding='utf-8') as f:
    # 遍历每一行并将其转换为 JSON
    for index, row in df_new.iterrows():
        row_dict = row.to_dict()
        row_json = json.dumps(row_dict, ensure_ascii=False,)
        # 将 JSON 字符串写入文件，并添加换行符
        f.write(row_json + '\n')

# 打印确认信息
print("JSONL 文件已生成")

二、模型训练

完成数据准备后，我们就可以利用这些数据进行模型的微调训练。这里使用了 Spark A-13B 的预训模型。

2.1 数据格式转换

首先，我们将准备好的数据集转换为 JSONL 格式，以便后续用于模型训练。

python 复制代码

import json

# 将数据集保存为JSONL格式
with open('output.jsonl', 'w', encoding='utf-8') as f:
    for index, row in df_new.iterrows():
        row_dict = row.to_dict()
        row_json = json.dumps(row_dict, ensure_ascii=False)
        f.write(row_json + '\n')

# 打印确认信息
print("JSONL 文件已生成")

2.2 上传数据集

首先我们进入讯飞开放平台官网网页，点击新建数据集：

这里我们配置一下数据集的相关信息；

接着我们上传之前制作的数据集，并且选择正确的问题和答案字段；

等待数据集上传成功，然后开始训练；

进入训练配置界面，我们配置模型名称，预训练模型，学习率，数据集等信息；

等待模型训练成功，这个过程需要至少30分钟这里我们可以喝杯咖啡等待一下！

如果大家没有应用请到 https://console.xfyun.cn/app/myapp 点击创建创建一个。

点击发布。稍等片刻，模型即可发布成功；内容如下：

这个界面我们可以可以看到我们发布模型的相关参数，我们要保存好以下参数，用于后续测试使用：

bash 复制代码

serviceId：---------
resourceId：-----------
APPID:------
APIKey:---------
APISecret:------------

至此，模型训练部分完毕！

三、本地测试

模型训练完成后，我们需要对模型进行本地测试，确保其生成的题目符合预期。

3.1 测试代码

以下是本地测试的代码，通过向模型提供一个 prompt，我们可以查看模型生成的题目和答案。

python 复制代码

from sparkai.llm.llm import ChatSparkLLM, ChunkPrintHandler
from sparkai.core.messages import ChatMessage

SPARKAI_URL = 'wss://xingchen-api.cn-huabei-1.xf-yun.com/v1.1/chat'
#星火认知大模型调用秘钥信息，请结合飞书文档，前往讯飞微调控制台（https://training.xfyun.cn/modelService）查看
SPARKAI_APP_ID = 'xxxxxxx'
SPARKAI_API_SECRET = 'xxxxxxx'
SPARKAI_API_KEY = 'xxxxxxxxxxxxxxxxxxx'
serviceId = 'xxxxxxxxx'  
resourceId = 'xxxxxxxxx'

if __name__ == '__main__':
    spark = ChatSparkLLM(
        spark_api_url=SPARKAI_URL,
        spark_app_id=SPARKAI_APP_ID,
        spark_api_key=SPARKAI_API_KEY,
        spark_api_secret=SPARKAI_API_SECRET,
        spark_llm_domain=serviceId,
        model_kwargs={"patch_id": resourceId},
        streaming=False,
    )
    messages = [ChatMessage(
        role="user",
        content=prompt
    )]
    handler = ChunkPrintHandler()
    a = spark.generate([messages], callbacks=[handler])
    print(a.generations[0][0].text)

运行结果如下：

输出正常！

总结

本文详细介绍了从数据准备、模型训练到本地测试的完整流程，着重介绍了大模型微调训练数据集的代码，并且通过讯飞开放平台，基于Spark 13B语言模型构建了一个的高考选择题生成模型。

最终，我们通过LLM本地调用发布的服务API对模型进行了测试；

希望这篇博客对各位读者构建类似系统有所帮助。

参考链接

如果觉得我的文章对您有帮助，三连+关注便是对我创作的最大鼓励！或者一个star🌟也可以😂.