Kaggle - LLM Science Exam上：赛事概述、数据收集、BERT Baseline

文章目录

- 一、赛事概述
- - [1.1 OpenBookQA Dataset](#1.1 OpenBookQA Dataset)
  - [1.2 比赛背景](#1.2 比赛背景)
  - [1.3 评估方法和代码要求](#1.3 评估方法和代码要求)
  - [1.4 比赛数据集](#1.4 比赛数据集)
  - [1.5 优秀notebook](#1.5 优秀notebook)
- [二、BERT Baseline](#二、BERT Baseline)
- - [2.1 数据预处理](#2.1 数据预处理)
  - [2.2 定义data_collator](#2.2 定义data_collator)
  - [2.3 加载模型，配置trainer并训练](#2.3 加载模型，配置trainer并训练)
  - [2.4 预测结果并提交](#2.4 预测结果并提交)
  - [2.5 相关优化](#2.5 相关优化)

前言：国庆期间哪也没去，重装了win10，conda和python环境，然后重点解读 Kaggle - LLM Science Exam赛事的优秀代码，希望可以学到些东西。

一、赛事概述

1.1 OpenBookQA Dataset

OpenBookQA Dataset是由美国艾伦人工智能研究院（Allen Institute for AI）发布的一个问答技术评测集，其主要目的是通过选择题考试的方式来测试和评估人工智能系统的问题回答能力，以下是更详细的介绍。

发布背景

许多之前的阅读理解数据集都是基于抽取式的方法,只需要从给定的上下文中抽取答案,而没必要进行更深层次的推理。OpenBookQA要求模型需要利用基础知识来回答问题,进行更复杂的推理。
数据集构成

OpenBookQA包含5957个四选一的科学常识问题(4,957 train, 500 dev, 500 test)。这些问题需要根据包含1326个科学事实的小"书本"来回答。问题采样自维基百科页面。
模型表现

回答OpenBookQA的问题不仅需要给定知识库中的科学常识，还需要额外的广泛常识知识。这些问题既不能通过检索算法回答正确，也不能通过词语共现算法回答正确。Strong neural baselines在OpenBookQA上只能达到约50%的准确率，与人类92%的准确率存在明显差距。
附加数据

该数据集还提供了5167个群众贡献的常识知识,以及扩展的训练集、开发集、测试集，每个问题对应其所考察的核心科学事实、人类准确率、清晰度评分等信息。
数据集意义

OpenBookQA推动了机器阅读理解从抽取式到推理式的发展，评估了模型在开放域知识下的深层理解和推理能力。

1.2 比赛背景

赛事地址：Kaggle - LLM Science Exam

LLM的能力：随着大型语言模型的能力不断扩展，研究领域中出现了使用LLMs来表征自身的趋势。因为许多现有的自然语言处理基准测试已经被最先进的模型轻松解决，所以有趣的工作是利用LLMs创建更具挑战性的任务，以测试更强大的模型。
数据生成：比赛使用了gpt3.5模型，该模型基于从维基百科中提取的各种科学主题的文本片段，要求它编写一个多项选择问题（附带已知答案），然后过滤掉简单的问题。
资源受限：本次比赛是一场代码比赛，GPU和时间都受到限制。
挑战性：虽然量化和知识蒸馏等技术可以有效地缩小语言模型以便在更少的硬件资源上运行，但这场比赛仍旧充满挑战。目前，目前在 Kaggle 上运行的最大模型有大约 100 亿个参数，而 gpt3.5 有 1750 亿个参数。如果一个问答模型能够轻松通过一个比其规模大10倍以上的模型编写的问答测试，这将是一个真正有趣的结果。另一方面，如果更大的模型能够有效地难住较小的模型，这对LLMs自我评估和测试的能力具有引人注目的影响。
竞赛旨在探讨比gpt3.5小10倍以上的问答模型能否有效回答gpt3.5编写的问题。结果将揭示LLM的基准测试和自我测试能力。

1.3 评估方法和代码要求

提交根据平均精度 @ 3 （MAP@3）进行评估：

其中，𝑈 为测试集中的问题数量，𝑃(𝑘) 为截断值为 𝑘 时的精确度，𝑛 为每个问题的预测数量，𝑟𝑒𝑙(𝑘) 为指示函数，如果排名为 𝑘 的项目是相关的（正确的）标签，则等于1，否则为0。

另外，某个问题正确预测后，后续将跳过该标签的其他预测，以防止刷准确度。举例来说，假设有一个测试集，里面有3个问题的正确答案都是A，如果有一个模型对这3个问题给出以下答案，那么以下情况都会得到平均精确度1.0的分数：

python 复制代码

[A, B, C, D, E] # 问题1预测
[A, A, A, A, A] # 问题2预测
[A, B, A, C, A] # 问题3预测

这意味着一旦找到正确答案（A），之后的预测不再影响平均精确度分数。

本次比赛必须以notebook提交，且CPU和GPU运行时间少于9小时。禁用互联网，但是允许使用公开的外部数据，包括预先训练的模型。另外提交文件必须命名为 submission.csv。

1.4 比赛数据集

本次比赛是回答由gpt3.5模型生成的4000道多选题组成的测试集。测试集是隐藏的，当提交notebook后，才会有实际的测试数据进行评测。

train.csv ： 200个样本，问题+答案，以显示数据格式，并大致了解测试集中的问题类型。
test.csv ：测试集，只包含题目，答案省略。
sample_submission.csv ：提交格式示例

具体的训练集格式如下：

python 复制代码

# Let's import the public training set and take a look
import pandas as pd

train_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/train.csv')
train_df.head()

对于测试集中的每个 id 标签，您最多可以预测 3 个标签。submission.csv文件应包含header并具有以下格式：

python 复制代码

id,prediction
0,	A B C
1,	B C A
2,	C A B
etc.

1.5 优秀notebook

《Starter Notebook: Ranked Predictions with BERT》：Bert Baseline，使用bert-base-cased和比赛提供的200个训练集样本进行训练，Public Score=0.545
《 $EDA, Data gathering$ LLM-SE ~ Wiki STEM | 1k DS》：比赛提供的200个样本太少了，作者LEONID KULYK先分析了比赛数据集，然后同样使用 gpt3.5 上收集了1000个Wikipedia样本，即Wikipedia STEM 1k
《LLM-SE ~ deberta-v3-large -i | 1k Wiki》:LEONID KULYK使用自己收集的1000个Wikipedia样本和比赛训练集样本一起训练，模型是deberta-v3-large。notebook中有最终模型权重，可直接推理，LB= 0.709。
《New dataset + DEBERTA v3 large training!》：0.723→0.759
- Radek 基于LEONID KULYK的工作，使用自己生成的500个额外数据训练DEBERTA v3 large，Public Score=0.723。
- 作者后来又生成了6000条数据，融合为6.5K数据集，并在此基础上训练模型，得到了三个模型权重，上传在Science Exam Trained Model Weights。通过《Inference using 3 trained Deberta v3 models》三个模型分别预测之后概率取平均，Public Score=0.737。而使用Voting Ensemble集成投票，Public Score=0.759
- 作者最后上传了15k high-quality train examples
《Open Book LLM Science Exam》：jjinho首次提出了Open Book方法，
《Open Book LLM Science Exam - Reduced RAM usage》：quangbk改进了jjinho方法中的内存效率。
《OpenBook DeBERTaV3-Large Baseline (Single Model》)： Anil将quangbk的Open Book方法与Radek的DEBERTA v3 large training结合起来，Public Score=0.771
《Sharing my trained-with-context model》：Mgoksu对 ANIL的方法中的DeBerta large进行微调（使用了自己制作的数据集），top public LB=0.807
《How To Train Open Book Model - Part 1》、《How To Train Open Book Model - Part 2》：CHRIS DEOTTE在mgoksu的基础上，加入自己制作的60k数据集进行训练，设置NUM_TITLES_INCLUDE = 5 和 NUM_SENTENCES_INCLUDE = 20，Public Score=0.819
《LLM Science Exam Optimise Ensemble Weights》：作者主要基于CHRIS DEOTTE的工作，使用了他训练的模型权重。另外为了增加多样性，还融合了其它几个没有使用Open Book的deberta-v3-large模型，Public Score=0.837。作者还写了《Using DeepSpeed with HF🤗 Trainer》等等
《LLM-SciEx Optimise Ensemble Weights(better models)》：通过模型融合，Public Score=0.846
《with only 270K articles》：作者自己制作了270K Wikipedia数据，使用LongFormer 模型进行训练，Public Score=0.862
《Platypus2-70B with Wikipedia RAG》：SIMJEG结合了上述方法8和12，最终Public Score=0.872。做了详细的解释。ALI在《Explained Platypus2-70B + Wikipedia RAG》中对SIMJEG的notebook做了详细的说明。

二、BERT Baseline

此部分参考《Starter Notebook: Ranked Predictions with BERT》，作者直接使用bert_base模型对训练集中的200个样本进行3个epoch的训练，然后再进行推理。大部分代码参考的是HF官方文档《Multiple choice》

2.1 数据预处理

python 复制代码

import pandas as pd
from datasets import Dataset

train_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/train.csv')
train_ds = Dataset.from_pandas(train_df)
train_df.head()

python 复制代码

from transformers import AutoTokenizer

model_dir = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_dir)

options = 'ABCDE'
indices = list(range(5))
option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}

def preprocess(example):
    # AutoModelForMultipleChoice 需要的是question/answer对，所以问题被复制5次
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    # 遍历选项（A 到 E）并将它们添加到 second_sentence 列表中
    for option in options:
        second_sentence.append(example[option])

    tokenized_example = tokenizer(first_sentence, second_sentence, truncation=True)
    # 将答案映射为索引，并将其添加到 tokenized_example 中作为标签
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

# 使用数据集映射（map）预处理函数到训练数据集，同时删除不需要的列
tokenized_train_ds = train_ds.map(preprocess, batched=False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
print(tokenized_train_ds[0])

python 复制代码

{'id': 1, 'input_ids': [[101, 5979, ...], [101, 5979, ...], [101, 5979, ...], [101, 5979, ...], [101, 5979, ...]], 'token_type_ids': [[0, 0, ...], [0, 0, ...],[0, 0, ...],[0, 0, ...],[0, 0, ...]], 'attention_mask': [[1, 1,...],[1, 1,...],[1, 1,...],[1, 1,...],[1, 1,...]], 'label': 0}

可以看到，每个样本的问题被重复5次后和5个选项合并，解码后的结果input_ids、token_type_ids、attention_mask都是5个元素的嵌套列表，等于一个样本被拆成5个样本。

有关填充和截断的详细信息，可参考官方文档《Padding and truncation》

2.2 定义data_collator

python 复制代码

#  datacollator 来自 https://huggingface.co/docs/transformers/tasks/multiple_choice
# 每个batch中对问答对进行动态填充（dynamically pad），所以不需要将每个问答对都填充到模型最大序列长度
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        # features就是4个样本（batch size=4)
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        # 对每个样本（feature，字典格式）使用pop删除key为label的键值对，返回被删除的值
        # 所以feature被删除了label键值对，而labels的值是四个样本label列表[0, 0, 1, 0]
        labels = [feature.pop(label_name) for feature in features] 
        batch_size = len(features)  						# 批次大小
        num_choices = len(features[0]['input_ids'])			# 选项数
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

传入数据：features为四个样本数据，每个样本格式和tokenized_train_ds[0]的格式一样

python 复制代码

[{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...], 'label': 0}, 
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...], 'label': 0}, 
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...], 'label': 1}, 
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...], 'label': 0}]

去除label标签：labels = [feature.pop(label_name) for feature in features] ，通过字典的pop方法，去除了每个样本中的label键值对，并将label的值取出，最终labels=[0, 0, 1, 0]。这一步之后，每个feature为：
python 复制代码
```
{'input_ids': [[...], [...], [...], [...], [...]], 'token_type_ids': [[...], [...], [...], [...], [...]], 'attention_mask': [[...], [...], [...], [...], [...]]}
```

执行flattened操作，此时flattened_features为：

python 复制代码

[[{'input_ids': ..., 'token_type_ids': ..., 'attention_mask': ...}, {...}, {...}, {...}, {...}], 
[{...}, {...}, {...}, {...}, {...}], 
[{...}, {...}, {...}, {...}, {...}],
 [{...}, {...}, {...}, {...}, {...}]]

sum(flattened_features, \[\])操作后，flattened_features为：

python 复制代码

# 加和操作后的flattened_features，成了20个样本
[{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]}, 
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]}, 
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]},
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]},
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]},
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]}, 
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]},
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]}, 
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]},
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]}, 
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]},
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]}, 
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]},
{'input_ids': [...], 'token_type_ids': [...], 'attention_mask': [...]}, ...]

这一步是将嵌套列表转为一维列表，方便后续解码时方便进行pad和batch操作。

最终结果为：

python 复制代码

{'input_ids': tensor([[[ 101, 2627...,    0]]]),
'token_type_ids': tensor([[[0, 0, 0,  ..., 0, 0]]]),
'attention_mask': tensor([[[1, 1, 1,  ..., 0, 0]]]),
'labels': tensor([0, 0, 1, 0])}

2.3 加载模型，配置trainer并训练

python 复制代码

from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
model = AutoModelForMultipleChoice.from_pretrained(model_dir)

output_dir = 'finetuned_bert'
training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to='none')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_train_ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer))

trainer.train()

python 复制代码

Epoch	Training Loss	Validation Loss
1			No log			1.564447
2			No log			1.527968
3			No log			1.417341

2.4 预测结果并提交

直接使用trainer预测

python 复制代码

test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
# 测试集没有answer列，加上这一列以保持和训练集格式一致，方便使用同样的处理方式
test_df['answer'] = 'A'  
test_ds = Dataset.from_pandas(test_df)
tokenized_test_ds = test_ds.map(preprocess, batched=False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])

test_predictions = trainer.predict(tokenized_test_ds) # 结果是PredictionOutput格式，包含predictions、label_ids、metrics三个字段
test_df.head()

python 复制代码

import numpy as np
def predictions_to_map_output(predictions):
	# 对每一行的预测结果按照降序排列，并获取每行的前三个答案的索引
	# np.argsort默认是对列表中元素值进行升序排列，并返回排序后元素值对应的索引
    top_answer_indices = np.argsort(-predictions)[:,:3]
    top_answers = [' '.join([index_to_option[idx] for idx in row]) for row in top_answer_indices]
    return top_answers

python 复制代码

# 获取测试集的id列，作为提交文件的id列
submission_df = test_df[['id']] 
submission_df['prediction'] = predictions_to_map_output(test_predictions.predictions)
submission_df.head()

python 复制代码

	id	prediction
0	0	D B E
1	1	B A D
2	2	A C D
3	3	C D A
4	4	E D C

重新加载模型预测

如果是重新打开notebook后再预测，需要先加载模型，设置推理的trainer参数再进行预测

python 复制代码

from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
model_checkpoint = "finetuned_bert/checkpoint-150"
model = AutoModelForMultipleChoice.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

python 复制代码

# 只进行推理
inference_args = TrainingArguments(
    output_dir="./inference_results",  # 推理结果的保存目录
    per_device_eval_batch_size=8,     # 每个设备的推理批量大小
)

trainer = Trainer(
    model=model,                  # 已加载的模型
    eval_dataset=tokenized_test_ds,
    args=inference_args,          # 推理的参数
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer)
)

test_predictions = trainer.predict(tokenized_test_ds)

后面的步骤都一样了。

2.5 相关优化

《LLM-SE ~ deberta-v3-large -i | 1k Wiki》:LEONID KULYK使用自己收集的1000个Wikipedia样本和比赛训练集样本一起训练，模型是deberta-v3-large。notebook中有最终模型权重，可直接推理，LB= 0.709。
《New dataset + DEBERTA v3 large training!》：0.723→0.759
- Radek 基于LEONID KULYK的工作，使用自己生成的500个额外数据训练DEBERTA v3 large，Public Score=0.723。
- 作者后来又生成了6000条数据，融合为6.5K数据集，并在此基础上训练模型，得到了三个模型权重，上传在Science Exam Trained Model Weights。通过《Inference using 3 trained Deberta v3 models》三个模型分别预测之后概率取平均，Public Score=0.737。而使用Voting Ensemble集成投票，Public Score=0.759
- 作者最后上传了15k high-quality train examples