详解大模型DPO训练数据格式

1. 引入

DPO(Direct Preference Optimization,直接偏好优化) 是 2023 年由斯坦福与谷歌提出的、替代 RLHF 的 LLM 人类对齐算法;核心是无需奖励模型 、无需 PPO,用偏好对数据 结合 特殊损失函数,直接把 SFT 模型微调成符合人类偏好的模型,训练更快、更稳、更省显存。

区别RLHF:

  • RLHF:必须单独训练奖励模型
  • DPO:无需奖励模型,直接用偏好对进行训练

只要按如下格式构造数据(参考1),使用LlamaFactory就能对LLM做DPO训练。

2. DPO数据集整体结构

DPO数据集的核心是"偏好对",即针对同一个"用户输入/对话场景",提供两个候选回答------chosen(选中/偏好回答)和rejected(拒绝/非偏好回答)。

参考1的数据集是一个JSON数组,数组中的每一个元素(JSON对象),就是一条完整的DPO训练样本。每条样本都包含3个核心字段,结构如下(简化版):

json 复制代码
{
  "conversations": [/* 对话上下文,含用户/系统输入 */],
  "chosen": {/* 人类偏好的模型回答 */},
  "rejected": {/* 人类排斥的模型回答 */}
}

核心逻辑含义:让模型学习"在conversations描述的场景下,输出chosen而非rejected"。不同于SFT(监督微调)的"单输入-单输出",DPO样本的核心是"对比 "------通过同一上下文下的好坏回答对比,让模型捕捉人类偏好的差异(比如:更详细更贴合需求更准确)。

"conversations"字段是一个数组,存储当前对话的完整上下文,包括用户输入(human)、系统提示(system,可选),甚至是多轮对话中的历史交互(如第5条样本的多轮问答)。它的作用是给模型提供"场景背景",明确"在什么情况下,需要区分chosen和rejected"。

3. 多条真实数据样例

按照上面介绍的"conversations,chosen,rejected"的三个字段含义,从参考1中,选出几个不同格式的样例。

3.1 基础格式

json 复制代码
  {
    "conversations": [
      {
        "from": "human",
        "value": "Hi! I'd like to create a new language game simulating the first person perspective of a character named Angela."
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "That sounds like a fun and engaging idea! Here are some tips to help you create the game:\n1. Start with the character's name and background: Before creating the game, it's important to have a clear idea of who Angela is and where she comes from. What is her age, gender, occupation, and interests? Knowing these details will help you create a relatable and realistic character that players can connect with.\n2. Decide on the objective of the game: What is the main objective of the game? Is it to help Angela navigate through a challenging environment, solve puzzles, or interact with other characters? Once you have a clear objective, you can start designing the game mechanics.\n3. Design the game mechanics: The game mechanics are the rules and controls that players use to interact with the game environment. Consider the camera angle, movement, interaction with objects, and any other features that will enhance the player's experience.\n4. Create a storyline: A storyline can add depth to the game and make it more engaging. Consider the obstacles that Angela will face, the challenges she will encounter, and the rewards she will receive.\n5. Add sound effects and music: Sound effects and music can help create atmosphere and enhance the player's experience. Consider using appropriate sound effects for each environment and adding music that fits the storyline.\n6. Test and refine: Once you have created the game, it's important to test it and get feedback from players. Use this feedback to refine the game and make it better.\nWith these tips, you should be well on your way to creating a fun and engaging language game simulating the first person perspective of Angela. Good luck!"
    },
    "rejected": {
      "from": "gpt",
      "value": "Hello! I'd be happy to help you create a language game simulating the first-person perspective of a character named Angela! What kind of game are you imagining? Do you have any specific details or themes in mind for the game, such as a setting, plot, or style? The more information you can provide, the better I can assist you. Additionally, please let me know if you have any specific preferences for the language or tone of the game. I'll do my best to suggest engaging and accurate responses as your helpful, respectful, and honest assistant! 😊"
    }
  },

这个数据中,human提供了具体的任务,from固定为gpt,表示这是模型输出。chosen的内容逼格rejected的要详细。

3.2 基础格式2

json 复制代码
  {
    "conversations": [
      {
        "from": "system",
        "value": "You are a teacher. Given a task, you explain in simple steps what the task is asking, any guidelines it provides and how to use those guidelines to find the answer."
      },
      {
        "from": "human",
        "value": "Q: In this task, you need to provide the parts-of-speech tag of a word present in a sentence specified within curly braces  ( '{{ ... }}' ). The parts-of-speech tags are coarse labels that represent a category of words with similar grammatical properties. The list of part-of-speech tags i.e tagset of this corpus is - \n '.': Period symbol is used for symbols denoting Punctuations/Separations such as comma, period, backticks etc., \n 'ADJ': Adjectives are words that typically modify nouns and specify their properties or attributes, \n 'ADP': Adposition is a cover term for prepositions and postpositions, \n 'ADV': Adverbs are words that typically modify verbs for such categories as time, place, direction or manner, \n 'CONJ': A word used to connect clauses or sentences or to coordinate words in the same clause, \n 'DET': Determiners are words that modify nouns or noun phrases and express the reference of the noun phrase in context, \n 'NOUN': Nouns are a part of speech typically denoting a person, place, thing, animal or idea, \n 'NUM': A numeral is a word, functioning most typically as a determiner, adjective or pronoun, that expresses a number and a relation to the number, such as quantity, sequence, frequency or fraction, \n 'PRT': Particles are function words that must be associated with another word or phrase to impart meaning and that do not satisfy definitions of other universal parts of speech, \n 'PRON': Pronouns are words that substitute for nouns or noun phrases, whose meaning is recoverable from the linguistic or extralinguistic context, \n 'PROPN': A proper noun is a noun (or nominal content word) that is the name (or part of the name) of a specific individual, place, or object, \n 'VERB': A verb is a member of the syntactic class of words that typically signal events and actions, can constitute a minimal predicate in a clause, and govern the number and types of other constituents which may occur in the clause, \n 'X': The tag X is used for words that for some reason cannot be assigned a real part-of-speech category.\nSentence: `` I'm not worried about it '' , Phil said {{ . }} \nWord: .\nA:"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "In this task, you are asked to identify the part-of-speech (POS) tag for a word in a given sentence. The word is specified within curly braces ({{...}}). The corpus follows a predefined tagset that indicates different parts-of-speech categories.\n\nTo complete this task:\n\n1. Carefully read the list of part-of-speech tags provided.\n2. Analyze the sentence given, paying attention to the word within the curly braces: \"{{ . }}\"\n3. Look through the tagset and determine which tag best fits the word within the curly braces.\n4. Provide the appropriate part-of-speech tag for the word.\n\nFor the given sentence: \"I'm not worried about it\", Phil said {{ . }}\nWord: .\n\nUsing the guidelines, we can see that the word '.' is a punctuation symbol. According to the tagset:\n'.': Period symbol is used for symbols denoting Punctuations/Separations such as comma, period, backticks, etc.\n\nSo, the part-of-speech tag for the word '.' is: '.'"
    },
    "rejected": {
      "from": "gpt",
      "value": "Sure, I'd be happy to help! Here are the steps to find the part of speech of the word in the given sentence:\n\nStep 1: Read the sentence and identify the word you want to find the part of speech for. In this case, the word is \"it\".\n\nStep 2: Look at the list of part-of-speech tags provided in the task. In this case, we have the following tags:\n\n'.': Period symbol (punctuation)\nADJ: Adjective\nADP: Adposition (preposition or postposition)\nADV: Adverb\nCONJ: Conjunction\nDET: Determiner\nNOUN: Noun\nNUM: Numeral\nPRT: Particle\nPRON: Pronoun\nPROPN: Proper noun\nVERB: Verb\nX: Unclassified word (for words that cannot be assigned a real part-of-speech category)\n\nStep 3: Look at the word \"it\" in the sentence and determine which category it best fits into. In this case, \"it\" is a pronoun, so the part of speech is PRON.\n\nTherefore, the part of speech of the word \"it\" in the sentence \"I'm not worried about it\", Phil said is PRON."
    }
  }

这条数据中,system表示系统提示词,用于定义模型角色,这里将角色定义为老师。其他字段与3.1一致。这里chosen的内容,回答准确,符合角色设定。rejected的内容,偏离用户需求,误将目标词 "." 换成 "it",导致后续词性判断完全错误(将 "it" 标注为 PRON,与用户要求的 "标注'.'的词性" 无关)。

3.3 多轮对话格式

json 复制代码
    "conversations": [
      {
        "from": "human",
        "value": "Who wrote the classic novel \"Pride and Prejudice\", published in 1813, that explores issues of manners, upbringing, morality, education, and marriage in the society of the landed gentry of early 19th-century England?"
      },
      {
        "from": "gpt",
        "value": "Jane Austen"
      },
      {
        "from": "human",
        "value": "Elaborate on the societal norms and values depicted in Jane Austen's \"Pride and Prejudice\"."
      },
      {
        "from": "gpt",
        "value": "\"Pride and Prejudice\" provides a vivid portrayal of early 19th-century English society, particularly the social class of the landed gentry, which is a class of people who were wealthy from owning land but did not have noble titles. \n\nThe novel explores the importance of marriage as a social and economic institution. It was often a woman's only means of securing a comfortable life, and marriages were frequently arranged based on wealth and social status rather than love or compatibility. This is seen in the character of Mrs. Bennet, who is preoccupied with marrying off her daughters to wealthy men.\n\nThe novel also illustrates the significance of reputation and decorum. The characters are constantly aware of the need to behave in ways that are socially acceptable and that maintain their social standing. Any scandal or impropriety could have severe consequences for a family's reputation and marriage prospects, as seen in the incident with Lydia Bennet.\n\nEducation is another theme in the novel. Women of this period were expected to be accomplished, but their education often focused on superficial accomplishments like music, drawing, and dancing, rather than intellectual pursuits. \n\nLastly, the novel explores the issues of pride and prejudice. Many characters, like Mr. Darcy and Elizabeth Bennet, have to overcome their initial judgments and prejudices about each other. The novel suggests that a balance of pride and humility is necessary for a successful relationship.\n\nIn all these ways, \"Pride and Prejudice\" offers a detailed and nuanced exploration of the societal norms and values of its time."
      },
      {
        "from": "human",
        "value": "Drawing from the societal norms and values discussed, how might the characters' decisions and actions in \"Pride and Prejudice\" be viewed differently if the novel was set in the present day?"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "If \"Pride and Prejudice\" was set in the present day, many of the characters' decisions and actions would be viewed differently based on modern societal norms and values. Some key differences could include:\n\n1. Marriage: In the current era, marriage is no longer primarily seen as a means of economic security or to elevate one's social status. Instead, individuals place a greater emphasis on personal happiness, emotional compatibility, and shared values in choosing a life partner. While some might still consider wealth and social status when estimating compatibility, these are no longer the primary focus for most people.\n\n2. Reputation and decorum: Today, society places more importance on genuine self-expression, individuality, and open communication. While maintaining a good reputation is still important, it is not as heavily reliant on adherence to strict societal norms and decorum. There is a greater tolerance for diverse behaviors and differing perspectives than in the early 19th century.\n\n3. Education: Today, a well-rounded education typically emphasizes both superficial accomplishments and intellectual pursuits. Men and women alike are encouraged to pursue a broad range of educational opportunities as a means to prepare them for diverse careers and personal growth.\n\n4. Proudness and humility: In modern times, a balance of pride and humility is, like in the past, essential for healthy relationships and personal growth. However, there is arguably greater emphasis on openness, self-reflection, and emotional intelligence, which often entail acknowledging and addressing one's weaknesses and mistakes in relationships and personal life.\n\n5. Social class: Today, social mobility and the blurring of societal boundaries have diluted the rigid class system that dominated early 19th-century England. Class inequality and privilege still exist, but they are challenged and discussed more openly. Many would view the entrenched social barriers in \"Pride and Prejudice\" more critically in the contemporary context.\n\nIn summary, while some themes and values remain relevant, the specifics of societal norms and values would make the characters' decisions and actions in \"Pride and Prejudice\" significantly different if the novel were set in the present day."
    },
    "rejected": {
      "from": "gpt",
      "value": "In a contemporary setting, the characters' decisions and actions in \"Pride and Prejudice\" might be viewed quite differently, given the significant societal changes that have occurred since the early 19th century.\n\nThe emphasis on marriage as a social and economic institution, for instance, might seem outdated or overly restrictive. Today, women have more options for economic independence and are not as reliant on marriage for financial security. Mrs. Bennet's preoccupation with marrying off her daughters to wealthy men might be seen as materialistic and controlling.\n\nThe strict adherence to social decorum and the importance of reputation might also seem excessive in today's more relaxed social climate. The scandal involving Lydia Bennet, for example, would likely not have the same severe consequences in today's society.\n\nThe limited education offered to women might be seen as sexist and unfair. In today's world, women have equal access to education and are encouraged to pursue intellectual pursuits.\n\nThe prejudices and initial judgments of characters like Mr. Darcy and Elizabeth Bennet might be seen as close-minded. In today's society, there is a greater emphasis on understanding and accepting people's differences.\n\nHowever, the themes of pride, prejudice, and the need for a balance of pride and humility in relationships are timeless and would likely still resonate with contemporary readers."
    }
  }

该数据中conversations字段包括了多轮对话的问题和答案,多轮对话的历史记录,为最后一个问题提供了上下文。其他字段含义与前面一致。

4. 医疗循证DPO数据集

参考2中,有一个医疗循证DPO数据集,格式示例如下:

json 复制代码
{
  "prompt": "在阿尔茨海默病与溃疡性结肠炎患者中,PPARG 和 NOS2 作为共同基因,是否通过调控巨噬细胞和小胶质细胞极化参与疾病的发生发展?",
  "chosen": "从目前的人类与动物实验证据来看,PPARG 和 NOS2 很有可能作为共同炎症枢纽基因,通过调控巨噬细胞/小胶质细胞的极化状态参与阿尔茨海默病和溃疡性结肠炎的发生发展...",
  "rejected": "这是一个非常具体且专业的问题,涉及到两种疾病的共同机制。根据现有的生物医学研究,我们可以进行一个基于科学逻辑的推理和分析..."
}

这个数据格式,本质上也和第2节提到的一致:

  • prompt:表示医学问题或查询
  • chosen:高质量回答(作为偏好目标)
  • rejected: 低质量回答(作为负样本)

5. 参考

  1. https://github.com/hiyouga/LlamaFactory/blob/main/data/dpo_en_demo.json
  2. https://www.modelscope.cn/datasets/modelzhang/medical_evidence_DPO
相关推荐
深小乐2 小时前
AI 时代,心力即修为
人工智能
你好潘先生2 小时前
让 AI 任务不丢进度:YeeroAI 后台续跑与全局快捷操作实践
前端·人工智能·后端
Together_CZ2 小时前
DTSemNet :Vanilla Gradient Descent for Oblique Decision Trees——用于倾斜决策树的普通梯度下降
算法·决策树·机器学习·vanilla·gradient·dtsemnet·用于倾斜决策树的普通梯度
Zenith_of_Serenity2 小时前
关于一个新手小白靠claude帮助下的全栈留言板项目开发
人工智能
@蔓蔓喜欢你2 小时前
Git最佳实践:团队协作的基石
人工智能·ai
dayuOK63072 小时前
内容创作者的“第二大脑”:AI如何重塑从灵感到发布的效率链?
人工智能·职场和发展·自动化·新媒体运营·媒体
Henry-SAP2 小时前
BOM层级传递与MRP计划生成实战解析
人工智能·sap·erp
2601_957888562 小时前
2026年GEO生成式引擎优化:当AI成为信息入口,品牌如何拿到“答案资格“?
大数据·人工智能
文歌子2 小时前
认识 Prithvi:NASA × IBM 的遥感基础模型
人工智能·深度学习