llama factory 大数据量下训练失败

bash 复制代码
06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
Traceback (most recent call last):
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1973, in _prepare_split_single
    for _, table in generator:
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 122, in _generate_tables
    io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)
  File "pyarrow/_json.pyx", line 52, in pyarrow._json.ReadOptions.__init__
  File "pyarrow/_json.pyx", line 77, in pyarrow._json.ReadOptions.block_size.__set__
OverflowError: value too large to convert to int32_t

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ca2/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/home/ca2/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/home/ca2/LLaMA-Factory/src/llmtuner/train/tuner.py", line 29, in run_exp
    run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/home/ca2/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 28, in run_pt
    dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="pt")
  File "/home/ca2/LLaMA-Factory/src/llmtuner/data/loader.py", line 162, in get_dataset
    all_datasets.append(load_single_dataset(dataset_attr, model_args, data_args))
  File "/home/ca2/LLaMA-Factory/src/llmtuner/data/loader.py", line 92, in load_single_dataset
    dataset = load_dataset(
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

从你的错误信息来看,问题出在加载和准备数据集时,具体包括以下两个主要错误:

  1. Checksum Failed:

    复制代码
    06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.

    这个警告提示 dataset_info.json 文件中缺少 SHA-1 哈希值。虽然这是一个警告,但它提示数据集的信息文件可能不完整或被篡改。

  2. OverflowError:

    复制代码
    OverflowError: value too large to convert to int32_t

    这个错误表明你试图加载的数据集文件块太大,超过了 int32_t 能表示的最大值。这通常是因为数据集文件块的大小设置不当。

这两个错误最终导致数据集生成失败,并抛出 DatasetGenerationError

解决方法:

  1. 检查数据集文件:

    • 确保数据集文件 dataset_info.json 完整且包含 SHA-1 哈希值。
    • 如果是从某个数据源下载的数据集,尝试重新下载或从可靠来源获取。
  2. 调整数据集加载选项:

    • 修改数据加载时的块大小设置。可能需要在你的代码中找到并调整 pyarrow 库的读取选项 block_size

    • 尝试修改 block_size 为较小的值,以避免溢出错误。例如:

      python 复制代码
      import pyarrow.json as paj
      
      read_options = paj.ReadOptions(block_size=65536)  # 64KB,或根据需要调整

      在加载数据集的代码中使用上述选项。

  3. 更新库版本:

    • 确保 datasetspyarrow 库是最新版本,某些 bug 可能已经在新版本中修复。

代码示例:

如果你使用的是 datasets 库,可以通过如下方式调整 block_size

python 复制代码
from datasets import load_dataset
import pyarrow.json as paj

# 设置较小的 block_size
read_options = paj.ReadOptions(block_size=65536)

dataset = load_dataset('your_dataset', split='train', read_options=read_options)

如果错误继续存在,请确保数据集文件的完整性,或者联系数据集提供者寻求帮助。

相关推荐
带娃的IT创业者28 分钟前
《Python实战进阶》No39:模型部署——TensorFlow Serving 与 ONNX
pytorch·python·tensorflow·持续部署
Bruce-li__34 分钟前
深入理解Python asyncio:从入门到实战,掌握异步编程精髓
网络·数据库·python
九月镇灵将1 小时前
6.git项目实现变更拉取与上传
git·python·scrapy·scrapyd·gitpython·gerapy
车载小杜1 小时前
基于指针的线程池
开发语言·c++
沐知全栈开发1 小时前
Servlet 点击计数器
开发语言
m0Java门徒1 小时前
Java 递归全解析:从原理到优化的实战指南
java·开发语言
小张学Python1 小时前
AI数字人Heygem:口播与唇形同步的福音,无需docker,无需配置环境,一键整合包来了
python·数字人·heygem
跳跳糖炒酸奶1 小时前
第四章、Isaacsim在GUI中构建机器人(2):组装一个简单的机器人
人工智能·python·算法·ubuntu·机器人
桃子酱紫君2 小时前
华为配置篇-BGP实验
开发语言·华为·php
步木木2 小时前
Anaconda和Pycharm的区别,以及如何选择两者
ide·python·pycharm