bash
(RayWorkerWrapper pid=1399515) INFO 11-26 07:55:28 selector.py:135] Using Flash Attention backend.
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] Traceback (most recent call last):
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 472, in execute_method
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] return executor(*args, **kwargs)
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker.py", line 135, in init_device
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] torch.cuda.set_device(self.device)
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] torch._C._cuda_setDevice(device)
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] RuntimeError: CUDA error: invalid device ordinal
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=1399511) ERROR 11-26 07:55:28 worker_base.py:480]
2024-11-26 07:55:33,849 ERROR worker.py:422 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1399511, ip=192.168.20.66, actor_id=936749e683585800d9229a9c01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7efaaddec610>)
File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 481, in execute_method
raise e
File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 472, in execute_method
return executor(*args, **kwargs)
File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/vllm/worker/worker.py", line 135, in init_device
torch.cuda.set_device(self.device)
File "/home/ca2/anaconda3/envs/qwen1.5/lib/python3.10/site-packages/torch/cuda/__init__.py", line 478, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
这个日志错误是一个典型的 CUDA 错误:无效的设备序号(invalid device ordinal),通常发生在尝试设置一个不存在的 GPU 设备时。具体来看,错误发生在以下几行代码中:
torch.cuda.set_device(self.device)
这个错误的主要原因是系统中没有找到指定的 CUDA 设备,可能的原因包括:
-
设备索引不正确:
torch.cuda.set_device()
接受的设备编号必须是有效的GPU设备编号,通常从0开始。例如,如果机器上只有2个GPU,合法的设备编号应该是0
或1
。错误提示说明self.device
被设置为一个无效的设备编号,可能是2
或更高的数字,导致错误。
-
CUDA设备不存在或不可用:
- 系统中可能没有启用GPU,或者某些GPU设备没有正确配置或无法被识别。比如,设备驱动或CUDA版本不匹配,或者系统的CUDA环境没有正确设置。
-
分布式执行时的设备问题:
- 错误提示中提到的是在RayWorkerWrapper中执行的
init_device
方法,可能是分布式环境中尝试在不同的工作节点上设置设备时出错。如果设备号配置不正确(例如,某个节点试图访问不存在的GPU),就会出现这种错误。
- 错误提示中提到的是在RayWorkerWrapper中执行的
解决方法
-
检查设备编号 :
确保使用的设备编号在系统上存在。可以通过以下命令查看系统中可用的CUDA设备:
bashnvidia-smi
然后检查代码中是否使用了正确的设备编号。
-
确保CUDA环境配置正确:
-
检查CUDA和cuDNN是否正确安装,并确保与PyTorch版本兼容。
-
使用以下命令检查PyTorch能否识别CUDA设备:
pythonimport torch print(torch.cuda.is_available()) # 检查CUDA是否可用 print(torch.cuda.device_count()) # 查看可用GPU数量
-
-
调试信息 :
为了进一步调试,可以通过设置环境变量来启用CUDA调试:
bashexport CUDA_LAUNCH_BLOCKING=1
这会使CUDA错误同步报告,可能有助于定位具体的问题。
-
设备配置检查 :
如果是在分布式环境下使用Ray工作者(
RayWorkerWrapper
),请确保每个工作节点的设备编号和配置都是正确的,并且每个节点上都能访问到所配置的GPU。 -
Ray配置 :
如果在使用Ray进行分布式计算,确保Ray正确分配了资源。检查Ray集群的配置,确保工作者节点的GPU设备正确分配。
总结
错误的核心是尝试在不存在的CUDA设备上设置设备,可能的原因包括设备编号超出范围、CUDA环境配置不正确或分布式执行中的设备问题。通过检查设备编号、CUDA配置以及使用调试工具可以帮助定位并解决问题。
bash
vllm serve /home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct --dtype bfloat16 --served-model-name /home/gpu/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct --max-model-len=16384 --tensor-parallel-size 2 --port 8000
INFO 11-26 06:51:11 api_server.py:585] vLLM API server version 0.6.4.post1
INFO 11-26 06:51:11 api_server.py:586] args: Namespace(subparser='serve', model_tag='/home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=16384, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['/home/gpu/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7fdbed684540>)
INFO 11-26 06:51:11 api_server.py:175] Multiprocessing frontend to use ipc:///tmp/3a17ee61-43bc-476f-b24b-2fcaa66170f2 for IPC Path.
INFO 11-26 06:51:11 api_server.py:194] Started engine process with PID 1283407
INFO 11-26 06:51:14 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 11-26 06:51:14 config.py:1020] Defaulting to use mp for distributed inference
WARNING 11-26 06:51:14 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-26 06:51:17 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
INFO 11-26 06:51:17 config.py:1020] Defaulting to use mp for distributed inference
WARNING 11-26 06:51:17 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-26 06:51:17 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct', speculative_config=None, tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/gpu/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-26 06:51:17 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 11-26 06:51:17 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=1283681) INFO 11-26 06:51:21 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=1283681) INFO 11-26 06:51:21 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 11-26 06:51:21 utils.py:961] Found nccl from library libnccl.so.2
INFO 11-26 06:51:21 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=1283681) INFO 11-26 06:51:21 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1283681) INFO 11-26 06:51:21 pynccl.py:69] vLLM is using nccl==2.21.5
Traceback (most recent call last):
File "/home/ca2/anaconda3/envs/llama3/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 609, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 113, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ca2/anaconda3/envs/llama3/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/ca2/anaconda3/envs/llama3/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
从日志来看,vLLM
在启动时发生了 RuntimeError
,提示 "Engine process failed to start",并且未能找到具体根因。同时,multiprocessing/resource_tracker
提示有泄漏的信号量对象。这些问题可能与以下几种原因相关:
1. 环境配置问题
bfloat16
不支持 :确保你的 GPU 支持bfloat16
精度(例如,Ampere 或更高架构的 NVIDIA GPU)。- NCCL 配置问题:多卡设置依赖 NCCL,但库版本或环境变量配置可能有问题。
解决方案:
-
确认 NCCL 是否正确安装:
bashpython -m torch.distributed.check_nccl_version
-
设置环境变量:
bashexport NCCL_DEBUG=INFO export NCCL_P2P_DISABLE=1 export NCCL_SHM_DISABLE=1
2. 资源泄漏或权限问题
- 资源泄漏 :日志提到
resource_tracker
检测到多个信号量泄漏,这可能导致进程启动失败。 - 文件锁冲突:共享文件系统可能引发并行进程的文件锁冲突。
解决方案:
-
检查虚拟机或主机上的资源是否充足,例如共享内存或文件描述符限制:
bashulimit -n
增加文件描述符上限(如必要):
bashulimit -n 65536
-
如果使用了共享文件系统(如 NFS),尝试切换到本地存储路径。
3. 启动配置错误
- 错误的模型路径或启动参数可能导致引擎无法加载。
--served-model-name
应提供唯一名称,而不是路径。
解决方案:
-
修复
--served-model-name
参数,改为简短的标识符,例如:bashvllm serve /home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct \ --dtype bfloat16 \ --served-model-name Qwen-14B-Instruct \ --max-model-len=16384 \ --tensor-parallel-size 2 \ --port 8000
4. vLLM 多卡支持问题
--tensor-parallel-size
参数要求每张 GPU 能够加载部分模型参数。- 如果 GPU 显存不足,可能导致启动失败。
解决方案:
-
降低
--tensor-parallel-size
或减小max-model-len
:bashvllm serve ... --tensor-parallel-size 1 --max-model-len=8192
-
验证 GPU 使用情况,确保每张卡有足够的可用显存:
bashnvidia-smi
5. vLLM 或依赖包版本问题
- 不兼容的 vLLM 版本或依赖包(如 PyTorch、CUDA)可能导致引擎无法启动。
解决方案:
-
确保使用兼容的环境(建议创建干净环境):
bashconda create -n vllm_env python=3.9 conda activate vllm_env pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install vllm
6. 模型本身问题
- Qwen-14B 模型是否完整下载、文件是否被正确加载可能导致问题。
解决方案:
-
验证模型完整性:
bashls /home/ca2/.cache/modelscope/hub/qwen/Qwen2___5-14B-Instruct
-
重新下载模型:
bashpip install modelscope modelscope download --model-name Qwen-14B-Instruct
7. 调试信息
如果以上步骤仍未解决,尝试添加以下调试环境变量并重试:
bash
export VLLM_LOG_LEVEL=DEBUG
export NCCL_DEBUG=INFO
运行命令查看更详细的错误日志。
通过以上解决方案,你应该可以逐步定位并解决 vLLM 启动失败的问题。如果仍有困难,可以进一步分享完整的调试日志。加粗样式