大模型的微调主要有以下几个方面:
- 有监督的微调 (Supervised Fine-tuning,SFT)。
- 奖励 / 偏好建模 (Reward / preference modeling,RM)。
- 基于人类反馈的强化学习 (RLHF)。
相关的代码可以在github上访问:github.com/night-is-yo...
本文主要实现了4种模型:
- baichuan
- chatglm3
- qwen
- yi
本文主要是介绍第二部分,
reward 训练官方的例子:github.com/huggingface...
python
parser = HfArgumentParser((RewardConfig, ModelConfig))
reward_config, model_config = parser.parse_args_into_dataclasses()
reward_config.gradient_checkpointing_kwargs = dict(use_reentrant=False)
################
# Model & Tokenizer
################
torch_dtype = (
model_config.torch_dtype
if model_config.torch_dtype in ["auto", None]
else getattr(torch, model_config.torch_dtype)
)
quantization_config = get_quantization_config(model_config)
model_kwargs = dict(
revision=model_config.model_revision,
trust_remote_code=model_config.trust_remote_code,
device_map=get_kbit_device_map() if quantization_config is not None else None,
quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(model_config.model_name_or_path, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(
model_config.model_name_or_path, num_labels=1, **model_kwargs
)
################
# Dataset
################
raw_datasets = load_dataset("Anthropic/hh-rlhf")
# Tokenize chosen/rejected pairs of inputs
# Adapt this section to your needs for custom datasets
def preprocess_function(examples):
new_examples = {
"input_ids_chosen": [],
"attention_mask_chosen": [],
"input_ids_rejected": [],
"attention_mask_rejected": [],
}
for chosen, rejected in zip(examples["chosen"], examples["rejected"]):
tokenized_chosen = tokenizer(chosen)
tokenized_rejected = tokenizer(rejected)
new_examples["input_ids_chosen"].append(tokenized_chosen["input_ids"])
new_examples["attention_mask_chosen"].append(tokenized_chosen["attention_mask"])
new_examples["input_ids_rejected"].append(tokenized_rejected["input_ids"])
new_examples["attention_mask_rejected"].append(tokenized_rejected["attention_mask"])
return new_examples
# Preprocess the dataset and filter out examples that are longer than args.max_length
raw_datasets = raw_datasets.map(
preprocess_function,
batched=True,
num_proc=4,
)
raw_datasets = raw_datasets.filter(
lambda x: len(x["input_ids_chosen"]) <= reward_config.max_length
and len(x["input_ids_rejected"]) <= reward_config.max_length
)
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]
################
# Training
################
trainer = RewardTrainer(
model=model,
tokenizer=tokenizer,
args=reward_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=get_peft_config(model_config),
)
trainer.train()
trainer.save_model(reward_config.output_dir)
值得注意的是,huggingface官方的代码是传入模型名,数据名
- 初始化RewardTrainer时,自动初始化模型
- 使用load_dataset自动加载数据
在实际使用中,建议自定义Dataset,加载数据,手动初始化模型
总结,reward训练分为以下几步:
- AutoModelForSequenceClassification加载模型
- 加载数据
- 初始化RewardTrainer
- 训练
RewardTrainer源码解读
损失计算方法
python
def compute_loss(
self,
model: Union[PreTrainedModel, nn.Module],
inputs: Dict[str, Union[torch.Tensor, Any]],
return_outputs=False,
) -> Union[torch.Tensor, Tuple[torch.Tensor, Dict[str, torch.Tensor]]]:
rewards_chosen = model(
input_ids=inputs["input_ids_chosen"],
attention_mask=inputs["attention_mask_chosen"],
return_dict=True,
)["logits"]
rewards_rejected = model(
input_ids=inputs["input_ids_rejected"],
attention_mask=inputs["attention_mask_rejected"],
return_dict=True,
)["logits"]
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
return loss
在以上的损失计算中,部分代码精简了,损失计算较简单,即chosen - rejected。然后再sigmoid,最后取对数。
数据加载方法,RewardDataCollatorWithPadding源码解读
在RewardTrainer中,数据i加载是使用的RewardDataCollatorWithPadding以下是其源码分析。
python
@dataclass
class RewardDataCollatorWithPadding:
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
return_tensors: str = "pt"
当我们准备自己的数据进行训练时,需要有以下参数。即:
- input_ids_chosen
- input_ids_rejected
- attention_mask_chosen
- attention_mask_rejected
yaml
for feature in features:
# check if the keys are named as expected
if (
"input_ids_chosen" not in feature
or "input_ids_rejected" not in feature
or "attention_mask_chosen" not in feature
or "attention_mask_rejected" not in feature
):
raise ValueError(
"The features should include `input_ids_chosen`, `attention_mask_chosen`, `input_ids_rejected` and `attention_mask_rejected`"
)
features_chosen.append(
{
"input_ids": feature["input_ids_chosen"],
"attention_mask": feature["attention_mask_chosen"],
}
)
features_rejected.append(
{
"input_ids": feature["input_ids_rejected"],
"attention_mask": feature["attention_mask_rejected"],
}
)
最后对数据进行pad。
python
batch_chosen = self.tokenizer.pad(
features_chosen,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch_rejected = self.tokenizer.pad(
features_rejected,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch = {
"input_ids_chosen": batch_chosen["input_ids"],
"attention_mask_chosen": batch_chosen["attention_mask"],
"input_ids_rejected": batch_rejected["input_ids"],
"attention_mask_rejected": batch_rejected["attention_mask"],
"return_loss": True,
}
if has_margin:
margin = torch.tensor(margin, dtype=torch.float)
batch["margin"] = margin
return batch