[huggingface]—离线加载数据集

Muasci2023-12-21 21:36

前言

服务器没网，需要手动下载，离线加载数据。

步骤

以加载下面这个数据集为例：

复制代码

dataset = load_dataset('stereoset', 'intrasentence')

去hugginface找到这个仓库，看files and versions里面的py文件，需要下载什么文件，比如：

https://huggingface.co/datasets/stereoset/blob/main/stereoset.py
_DOWNLOAD_URL = "https://github.com/moinnadeem/Stereoset/raw/master/data/dev.json"
把这个dev.json，以及files and versions里面的其他文件（这里是dataset_infos.json，stereoset.py）都下载下来，放入目录X。
把加载数据的那行代码改成：

dataset = load_dataset("X/stereoset.py", 'intrasentence')

（如果是dataset = load_dataset("X", 'intrasentence')，会走site-packages/datasets/builder.py的def _prepare_split_single，可能会报如下错）

复制代码

ValueError: Not able to read records in the JSON file at /data/syxu/representation-engineering/data/fairness/dev.json. You should probably indicate the field of the JSON file containing your records. This JSON file contain the following fields: ['version', 'data']. Select the correct one and provide it as `field='XXX'` to the dataset loading method.

改_split_generators中得到data_path的方式

原来可能是：

复制代码

data_path = dl_manager.download_and_extract(self._DOWNLOAD_URL)

注释掉这行，把data_path直接改成'X/dev.json'

最后，通过环境变量设置为离线模式

export HF_DATASETS_OFFLINE=1

其他情况

parquet文件：

复制代码

from datasets import load_dataset
dataset = load_dataset("parquet", data_files={'train': [文件路径], 'test': [同]})

参考

https://huggingface.co/docs/datasets/v1.12.0/loading.html