vllm在h100单机多卡上部署异常记录与可以尝试的解决方案

bash 复制代码
(RayWorkerWrapper pid=1399515) INFO 11-26 07:55:28 selector.py:135] Using Flash Attention backend.
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] Traceback (most recent call last):
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480]   File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 472, in execute_method
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480]   File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker.py", line 135, in init_device
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480]     torch.cuda.set_device(self.device)
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480]   File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480]     torch._C._cuda_setDevice(device)
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] RuntimeError: CUDA error: invalid device ordinal
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] 
2024-11-26 07:55:33,849	ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1399511, ip=192.168.20.66, actor_id=936749e683585800d9229a9c01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7efaaddec610>)
  File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 481, in execute_method
    raise e
  File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 472, in execute_method
    return executor(*args, **kwargs)
  File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker.py", line 135, in init_device
    torch.cuda.set_device(self.device)
  File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

这个日志错误是一个典型的 CUDA 错误:无效的设备序号(invalid device ordinal),通常发生在尝试设置一个不存在的 GPU 设备时。具体来看,错误发生在以下几行代码中:

torch.cuda.set_device(self.device)

这个错误的主要原因是系统中没有找到指定的 CUDA 设备,可能的原因包括:

  1. 设备索引不正确

    • torch.cuda.set_device() 接受的设备编号必须是有效的GPU设备编号,通常从0开始。例如,如果机器上只有2个GPU,合法的设备编号应该是 01。错误提示说明 self.device 被设置为一个无效的设备编号,可能是 2 或更高的数字,导致错误。
  2. CUDA设备不存在或不可用

    • 系统中可能没有启用GPU,或者某些GPU设备没有正确配置或无法被识别。比如,设备驱动或CUDA版本不匹配,或者系统的CUDA环境没有正确设置。
  3. 分布式执行时的设备问题

    • 错误提示中提到的是在RayWorkerWrapper中执行的init_device方法,可能是分布式环境中尝试在不同的工作节点上设置设备时出错。如果设备号配置不正确(例如,某个节点试图访问不存在的GPU),就会出现这种错误。

解决方法

  1. 检查设备编号

    确保使用的设备编号在系统上存在。可以通过以下命令查看系统中可用的CUDA设备:

    bash 复制代码
    nvidia-smi

    然后检查代码中是否使用了正确的设备编号。

  2. 确保CUDA环境配置正确

    • 检查CUDA和cuDNN是否正确安装,并确保与PyTorch版本兼容。

    • 使用以下命令检查PyTorch能否识别CUDA设备:

      python 复制代码
      import torch
      print(torch.cuda.is_available())  # 检查CUDA是否可用
      print(torch.cuda.device_count())  # 查看可用GPU数量
  3. 调试信息

    为了进一步调试,可以通过设置环境变量来启用CUDA调试:

    bash 复制代码
    export CUDA_LAUNCH_BLOCKING=1

    这会使CUDA错误同步报告,可能有助于定位具体的问题。

  4. 设备配置检查

    如果是在分布式环境下使用Ray工作者(RayWorkerWrapper),请确保每个工作节点的设备编号和配置都是正确的,并且每个节点上都能访问到所配置的GPU。

  5. Ray配置

    如果在使用Ray进行分布式计算,确保Ray正确分配了资源。检查Ray集群的配置,确保工作者节点的GPU设备正确分配。

总结

错误的核心是尝试在不存在的CUDA设备上设置设备,可能的原因包括设备编号超出范围、CUDA环境配置不正确或分布式执行中的设备问题。通过检查设备编号、CUDA配置以及使用调试工具可以帮助定位并解决问题。

bash 复制代码
vllm serve /home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct   --dtype bfloat16  --served-model-name /home/gpu/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct   --max-model-len=16384   --tensor-parallel-size 2 --port 8000
INFO 11-26 06:51:11 api_server.py:585] vLLM API server version 0.6.4.post1
INFO 11-26 06:51:11 api_server.py:586] args: Namespace(subparser='serve', model_tag='/home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=16384, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['/home/gpu/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7fdbed684540>)
INFO 11-26 06:51:11 api_server.py:175] Multiprocessing frontend to use ipc:///tmp/3a17ee61-43bc-476f-b24b-2fcaa66170f2 for IPC Path.
INFO 11-26 06:51:11 api_server.py:194] Started engine process with PID 1283407
INFO 11-26 06:51:14 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 11-26 06:51:14 config.py:1020] Defaulting to use mp for distributed inference
WARNING 11-26 06:51:14 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-26 06:51:17 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
INFO 11-26 06:51:17 config.py:1020] Defaulting to use mp for distributed inference
WARNING 11-26 06:51:17 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-26 06:51:17 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct', speculative_config=None, tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/gpu/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-26 06:51:17 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 11-26 06:51:17 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=1283681) INFO 11-26 06:51:21 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=1283681) INFO 11-26 06:51:21 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 11-26 06:51:21 utils.py:961] Found nccl from library libnccl.so.2
INFO 11-26 06:51:21 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=1283681) INFO 11-26 06:51:21 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1283681) INFO 11-26 06:51:21 pynccl.py:69] vLLM is using nccl==2.21.5
Traceback (most recent call last):
  File "/home/ca2/anaconda3/envs/llama3/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/scripts.py", line 195, in main
    args.dispatch_function(args)
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/scripts.py", line 41, in serve
    uvloop.run(run_server(args))
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 609, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 113, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/ca2/anaconda3/envs/llama3/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

从日志来看,vLLM 在启动时发生了 RuntimeError,提示 "Engine process failed to start",并且未能找到具体根因。同时,multiprocessing/resource_tracker 提示有泄漏的信号量对象。这些问题可能与以下几种原因相关:


1. 环境配置问题

  • bfloat16 不支持 :确保你的 GPU 支持 bfloat16 精度(例如,Ampere 或更高架构的 NVIDIA GPU)。
  • NCCL 配置问题:多卡设置依赖 NCCL,但库版本或环境变量配置可能有问题。

解决方案

  • 确认 NCCL 是否正确安装:

    bash 复制代码
    python -m torch.distributed.check_nccl_version
  • 设置环境变量:

    bash 复制代码
    export NCCL_DEBUG=INFO
    export NCCL_P2P_DISABLE=1
    export NCCL_SHM_DISABLE=1

2. 资源泄漏或权限问题

  • 资源泄漏 :日志提到 resource_tracker 检测到多个信号量泄漏,这可能导致进程启动失败。
  • 文件锁冲突:共享文件系统可能引发并行进程的文件锁冲突。

解决方案

  • 检查虚拟机或主机上的资源是否充足,例如共享内存或文件描述符限制:

    bash 复制代码
    ulimit -n

    增加文件描述符上限(如必要):

    bash 复制代码
    ulimit -n 65536
  • 如果使用了共享文件系统(如 NFS),尝试切换到本地存储路径。


3. 启动配置错误

  • 错误的模型路径或启动参数可能导致引擎无法加载。
  • --served-model-name 应提供唯一名称,而不是路径。

解决方案

  • 修复 --served-model-name 参数,改为简短的标识符,例如:

    bash 复制代码
    vllm serve /home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct \
      --dtype bfloat16 \
      --served-model-name Qwen-14B-Instruct \
      --max-model-len=16384 \
      --tensor-parallel-size 2 \
      --port 8000

4. vLLM 多卡支持问题

  • --tensor-parallel-size 参数要求每张 GPU 能够加载部分模型参数。
  • 如果 GPU 显存不足,可能导致启动失败。

解决方案

  • 降低 --tensor-parallel-size 或减小 max-model-len

    bash 复制代码
    vllm serve ... --tensor-parallel-size 1 --max-model-len=8192
  • 验证 GPU 使用情况,确保每张卡有足够的可用显存:

    bash 复制代码
    nvidia-smi

5. vLLM 或依赖包版本问题

  • 不兼容的 vLLM 版本或依赖包(如 PyTorch、CUDA)可能导致引擎无法启动。

解决方案

  • 确保使用兼容的环境(建议创建干净环境):

    bash 复制代码
    conda create -n vllm_env python=3.9
    conda activate vllm_env
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    pip install vllm

6. 模型本身问题

  • Qwen-14B 模型是否完整下载、文件是否被正确加载可能导致问题。

解决方案

  • 验证模型完整性:

    bash 复制代码
    ls /home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct
  • 重新下载模型:

    bash 复制代码
    pip install modelscope
    modelscope download --model-name Qwen-14B-Instruct

7. 调试信息

如果以上步骤仍未解决,尝试添加以下调试环境变量并重试:

bash 复制代码
export VLLM_LOG_LEVEL=DEBUG
export NCCL_DEBUG=INFO

运行命令查看更详细的错误日志。


通过以上解决方案,你应该可以逐步定位并解决 vLLM 启动失败的问题。如果仍有困难,可以进一步分享完整的调试日志。加粗样式

相关推荐
CRMEB系统商城3 小时前
多商户系统推动旅游业数字化升级与创新,定制化旅游促进市场多元化发展
大数据·人工智能·旅游
chenchihwen5 小时前
大语言模型LLM的微调代码详解
人工智能·深度学习·语言模型
xianghan收藏册5 小时前
提示学习(Prompting)篇
人工智能·深度学习·自然语言处理·chatgpt·transformer
三月七(爱看动漫的程序员)5 小时前
Prompting LLMs to Solve Complex Tasks: A Review
人工智能·gpt·语言模型·自然语言处理·chatgpt·langchain·llama
robinfang20196 小时前
AI在医学领域:弱监督方法自动识别牙痕舌
人工智能·健康医疗
weixin_446260856 小时前
AI大模型学习
人工智能·学习
weixin_452600697 小时前
【青牛科技】D1117 1.0A低压差线性稳压电路芯片介绍,可保证了输出电压精度控制在±1.5%的范围内
人工智能·科技·单片机·嵌入式硬件·新能源充电桩·dvd 解码板
封步宇AIGC7 小时前
量化交易系统开发-实时行情自动化交易-4.4.1.做市策略实现
人工智能·python·机器学习·数据挖掘
港股研究社8 小时前
华为Mate 70系列发布,揭示AI+消费电子产业化新阶段
人工智能·华为