【bug】大模型微调bug:OSError: Failed to load tokenizer.| Lora

LLaMA-Factory 默认使用的是 Hugging Face Hub(huggingface.co)作为模型下载源 。如果你在国内、或者使用的是 ModelScope 镜像(魔搭社区) ,就必须手动改配置或环境变量,让它从 modelscope.cn 拉取模型。

报错原文

yaml 复制代码
root@dsw-1400022-546bdb8d54-27nsv:/mnt/workspace/LLaMA-Factory# llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
2025-10-16 13:00:45.897740: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-10-16 13:00:45.937885: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-10-16 13:00:46.853637: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2025-10-16 13:00:49,181] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: 没有那个文件或目录
[2025-10-16 13:00:51,610] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
[INFO|2025-10-16 13:00:52] llamafactory.hparams.parser:423 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: False, compute dtype: torch.bfloat16
[INFO|tokenization_auto.py:898] 2025-10-16 13:01:02,405 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 488, in _make_request
    raise new_e
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 464, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 704, in connect
    self.sock = sock = self._new_conn()
                       ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 213, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f63fc1bfcd0>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /LLM-Research/Meta-Llama-3-8B-Instruct/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f63fc1bfcd0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1546, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1463, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 286, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 309, in _request_wrapper
    response = http_backoff(method=method, url=url, **params, retry_on_exceptions=(), retry_on_status_codes=(429,))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 310, in http_backoff
    response = session.request(method=method, url=url, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 96, in send
    return super().send(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 700, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /LLM-Research/Meta-Llama-3-8B-Instruct/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f63fc1bfcd0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: 173af757-1a47-42b9-9662-480a9cec02c5)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/hub.py", line 479, in cached_files
    hf_hub_download(
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1010, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1117, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1661, in _raise_on_head_call_error
    raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/workspace/LLaMA-Factory/src/llamafactory/model/loader.py", line 78, in load_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 1069, in from_pretrained
    config = AutoConfig.from_pretrained(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1250, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/configuration_utils.py", line 649, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/configuration_utils.py", line 708, in _get_config_dict
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/hub.py", line 321, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/hub.py", line 553, in cached_files
    raise OSError(
OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
Check your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/workspace/LLaMA-Factory/src/llamafactory/cli.py", line 24, in main
    launcher.launch()
  File "/mnt/workspace/LLaMA-Factory/src/llamafactory/launcher.py", line 152, in launch
    run_exp()
  File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp
    _training_function(config={"args": args, "callbacks": callbacks})
  File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 72, in _training_function
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/mnt/workspace/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 48, in run_sft
    tokenizer_module = load_tokenizer(model_args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/workspace/LLaMA-Factory/src/llamafactory/model/loader.py", line 93, in load_tokenizer
    raise OSError("Failed to load tokenizer.") from e
OSError: Failed to load tokenizer.

bug分析

我的模型下载源默认是在Hubgging Face Hub。但是我是在国内,用的是ModelScope社区镜像,魔塔社区,拉取模型的时候吗,下载外网镜像会失败。

解决方案

将镜像源切换到国内的魔塔社区。

llama3_lora_sft.yaml 里直接指定 ModelScope 模型路径:

yaml 复制代码
model_name_or_path: LLM-Research/Meta-Llama-3-8B-Instruct

设置环境变量:

yaml 复制代码
export USE_MODELSCOPE_HUB=1
export MODELSCOPE_CACHE=/mnt/workspace/.cache/modelscope

开始训练:

yaml 复制代码
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

训练开始如图所示:

相关推荐
咕白m6252 分钟前
通过 C# 快速生成二维码 (QR code)
后端·.net
踏浪无痕9 分钟前
架构师如何学习 AI:三个月掌握核心能力的务实路径
人工智能·后端·程序员
小毅&Nora26 分钟前
【后端】【SpringBoot】① 源码解析:从启动到优雅关闭
spring boot·后端·优雅关闭
嘻哈baby28 分钟前
从TIME_WAIT爆炸到端口耗尽:Linux短连接服务排查与优化
后端
开心就好20251 小时前
iOS应用性能监控全面解析:CPU、内存、FPS、卡顿与内存泄漏检测
后端
问今域中2 小时前
Spring Boot 请求参数绑定注解
java·spring boot·后端
计算机程序设计小李同学2 小时前
婚纱摄影集成管理系统小程序
java·vue.js·spring boot·后端·微信小程序·小程序
一 乐2 小时前
绿色农产品销售|基于springboot + vue绿色农产品销售系统(源码+数据库+文档)
java·前端·数据库·vue.js·spring boot·后端·宠物
3***68843 小时前
Spring Boot中使用Server-Sent Events (SSE) 实现实时数据推送教程
java·spring boot·后端
C***u1763 小时前
Spring Boot问题总结
java·spring boot·后端