llama factory 大数据量下训练失败

bash 复制代码
06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
Traceback (most recent call last):
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1973, in _prepare_split_single
    for _, table in generator:
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 122, in _generate_tables
    io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)
  File "pyarrow/_json.pyx", line 52, in pyarrow._json.ReadOptions.__init__
  File "pyarrow/_json.pyx", line 77, in pyarrow._json.ReadOptions.block_size.__set__
OverflowError: value too large to convert to int32_t

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ca2/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/home/ca2/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/home/ca2/LLaMA-Factory/src/llmtuner/train/tuner.py", line 29, in run_exp
    run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/home/ca2/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 28, in run_pt
    dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="pt")
  File "/home/ca2/LLaMA-Factory/src/llmtuner/data/loader.py", line 162, in get_dataset
    all_datasets.append(load_single_dataset(dataset_attr, model_args, data_args))
  File "/home/ca2/LLaMA-Factory/src/llmtuner/data/loader.py", line 92, in load_single_dataset
    dataset = load_dataset(
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

从你的错误信息来看,问题出在加载和准备数据集时,具体包括以下两个主要错误:

  1. Checksum Failed:

    06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
    

    这个警告提示 dataset_info.json 文件中缺少 SHA-1 哈希值。虽然这是一个警告,但它提示数据集的信息文件可能不完整或被篡改。

  2. OverflowError:

    OverflowError: value too large to convert to int32_t
    

    这个错误表明你试图加载的数据集文件块太大,超过了 int32_t 能表示的最大值。这通常是因为数据集文件块的大小设置不当。

这两个错误最终导致数据集生成失败,并抛出 DatasetGenerationError

解决方法:

  1. 检查数据集文件:

    • 确保数据集文件 dataset_info.json 完整且包含 SHA-1 哈希值。
    • 如果是从某个数据源下载的数据集,尝试重新下载或从可靠来源获取。
  2. 调整数据集加载选项:

    • 修改数据加载时的块大小设置。可能需要在你的代码中找到并调整 pyarrow 库的读取选项 block_size

    • 尝试修改 block_size 为较小的值,以避免溢出错误。例如:

      python 复制代码
      import pyarrow.json as paj
      
      read_options = paj.ReadOptions(block_size=65536)  # 64KB,或根据需要调整

      在加载数据集的代码中使用上述选项。

  3. 更新库版本:

    • 确保 datasetspyarrow 库是最新版本,某些 bug 可能已经在新版本中修复。

代码示例:

如果你使用的是 datasets 库,可以通过如下方式调整 block_size

python 复制代码
from datasets import load_dataset
import pyarrow.json as paj

# 设置较小的 block_size
read_options = paj.ReadOptions(block_size=65536)

dataset = load_dataset('your_dataset', split='train', read_options=read_options)

如果错误继续存在,请确保数据集文件的完整性,或者联系数据集提供者寻求帮助。

相关推荐
FightingLod1 分钟前
C++中list容器使用详解
开发语言·c++·list
Zucker n10 分钟前
学会python——用python制作一个登录和注册窗口(python实例十八)
开发语言·python
mana飞侠11 分钟前
代码随想录算法训练营第59天:动态[1]
开发语言·数据结构·算法·动态规划
艾恩小灰灰13 分钟前
为何Web前端开发仍坚守 HTML 和 CSS,而不全然拥抱纯 JavaScript?
开发语言·前端·javascript·css·html·纯js
瑶风22 分钟前
go语言并发编程2-runtime
开发语言·golang·xcode
Eiceblue25 分钟前
Python 插入、替换、提取、或删除Excel中的图片
开发语言·vscode·python·pycharm·excel
=(^.^)=哈哈哈26 分钟前
Go语言实现的端口扫描工具示例
开发语言·后端·golang
神奇夜光杯27 分钟前
Python酷库之旅-第三方库Pandas(003)
开发语言·ide·python·pandas·基础知识·学习和成长·标准库及第三方库
Leon哉34 分钟前
PyCharm中如何将某个文件设置为默认运行文件
ide·python·pycharm
王天平·Jason Wong38 分钟前
汉王、绘王签字版调用封装
开发语言·前端·javascript