vllm 踩坑记录 算力匹配

目录

[NVIDIA GPU 算力速查表](#NVIDIA GPU 算力速查表)

[vllm 安装:](#vllm 安装:)

库冲突:


NVIDIA GPU 算力速查表

https://blog.csdn.net/jacke121/article/details/159576930

vllm 安装:

bash 复制代码
pip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# for vllm>=0.11.0
pip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
python vllm_example.py
复制代码
flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp310-cp310-linux_x86_64.whl

库冲突:

torch 2.7.1+cu128

torchaudio 2.7.1+cu128

torchmetrics 1.9.0

torchvision 0.22.1+cu128

widgetsnbextension 4.0.15

x-transformers 2.7.2

xformers 0.0.31

xfuser 0.4.5

vllm 0.9.0

flash_attn 2.8.0.post2 不支持torch2.8

结合你的 GPU 是 RTX 5090(Compute Capability 12.0),而 vLLM 0.9.0 包含 SM 7.0-9.0 的内核。

vllm==v0.11.0 支持12.0的算力。

RTX 4080 的 Compute Capability 是 8.9,和 RTX 4090、RTX 4070 等整个 40 系列一样。

/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py

INFO 03-28 11:23:44 __init__.py:243 Automatically detected platform cuda.

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.

2026-03-28 11:23:52,335 INFO input frame rate=25

/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.

WeightNorm.apply(module, name, dim)

/data/lbg/envs/flashtalk/lib/python3.10/site-packages/pyworld/init.py:13: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.

import pkg_resources

/data/lbg/envs/flashtalk/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'

warnings.warn(

2026-03-28 11:23:56,030 INFO no frontend is avaliable

INFO 03-28 11:23:59 __init__.py:31 Available plugins for group vllm.general_plugins:

INFO 03-28 11:23:59 __init__.py:33 - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver

INFO 03-28 11:23:59 __init__.py:36 All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.

INFO 03-28 11:24:06 config.py:793 This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'.

WARNING 03-28 11:24:06 arg_utils.py:1583 --enable-prompt-embeds is not supported by the V1 Engine. Falling back to V0.

INFO 03-28 11:24:06 llm_engine.py:230 Initializing a V0 LLM engine (v0.9.0) with config: model='/data/lbg/models/CosyVoice3-0.5B/vllm', speculative_config=None, tokenizer='/data/lbg/models/CosyVoice3-0.5B/vllm', skip_tokenizer_init=True, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/lbg/models/CosyVoice3-0.5B/vllm, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"compile_sizes": \[\], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1, "max_capture_size": 256}, use_cached_outputs=False,

INFO 03-28 11:24:06 cuda.py:292 Using Flash Attention backend.

W328 11:24:17.372136289 socket.cpp:200 c10d The hostname of the client socket cannot be retrieved. err=-3

INFO 03-28 11:24:22 parallel_state.py:1064 rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

INFO 03-28 11:24:22 model_runner.py:1170 Starting to load model /data/lbg/models/CosyVoice3-0.5B/vllm...

Loading safetensors checkpoint shards: 0% Completed | 0/1 00:00\

Loading safetensors checkpoint shards: 100% Completed | 1/1 00:00\<00:00, 2.01it/s

Loading safetensors checkpoint shards: 100% Completed | 1/1 00:00\<00:00, 2.01it/s

INFO 03-28 11:24:23 default_loader.py:280 Loading weights took 0.52 seconds

INFO 03-28 11:24:23 model_runner.py:1202 Model loading took 0.7001 GiB and 0.599737 seconds

rank0: Traceback (most recent call last):

rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py", line 40, in <module>

rank0: main()

rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py", line 36, in main

rank0: cosyvoice3_example()

rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py", line 26, in cosyvoice3_example

rank0: cosyvoice = AutoModel(model_dir='/data/lbg/models/CosyVoice3-0.5B', load_trt=True, load_vllm=True, fp16=False)

rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py", line 236, in AutoModel

rank0: return CosyVoice3(**kwargs)

rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py", line 217, in init

rank0: self.model.load_vllm('{}/vllm'.format(model_dir))

rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/model.py", line 288, in load_vllm

rank0: self.llm.vllm = LLMEngine.from_engine_args(engine_args)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 501, in from_engine_args

rank0: return engine_cls.from_vllm_config(

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 477, in from_vllm_config

rank0: return cls(

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 268, in init

rank0: self._initialize_kv_caches()

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches

rank0: self.model_executor.determine_num_available_blocks())

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks

rank0: results = self.collective_rpc("determine_num_available_blocks")

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc

rank0: answer = run_method(self.driver_worker, method, args, kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/utils.py", line 2605, in run_method

rank0: return func(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context

rank0: return func(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in determine_num_available_blocks

rank0: self.model_runner.profile_run()

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context

rank0: return func(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1299, in profile_run

rank0: self._dummy_run(max_num_batched_tokens, max_num_seqs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1425, in _dummy_run

rank0: self.execute_model(model_input, kv_caches, intermediate_tensors)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context

rank0: return func(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1843, in execute_model

rank0: hidden_or_intermediate_states = model_executable(

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 481, in forward

rank0: hidden_states = self.model(input_ids, positions, intermediate_tensors,

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 172, in call

rank0: return self.forward(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 358, in forward

rank0: hidden_states, residual = layer(

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 257, in forward

rank0: hidden_states = self.self_attn(

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 184, in forward

rank0: qkv, _ = self.qkv_proj(hidden_states)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 486, in forward

rank0: output_parallel = self.quant_method.apply(self, input_, bias)

rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 202, in apply

rank0: return dispatch_unquantized_gemm()(x, layer.weight, bias)

rank0: RuntimeError: CUDA error: no kernel image is available for execution on the device

rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1

rank0: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

rank0:W328 11:24:24.967138206 ProcessGroupNCCL.cpp:1479 Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

相关推荐
大数据魔法师42 分钟前
Streamlit(二十三)- 教程(二)- 动态导航
python·web
心中有国也有家3 小时前
GE图引擎深度解析——CANN的计算图优化与执行引擎
人工智能·pytorch·python·学习·numpy
卷毛的技术笔记4 小时前
告别硬编码!Spring AI Alibaba 实现 AI Agent 智能工具调用(Tool Calling)
java·人工智能·后端·python·spring·ai编程
编程大师哥4 小时前
匿名函数 lambda + 高阶函数
java·python·算法
vb2008115 小时前
FastAPI APIRouter
开发语言·python
adrninistrat0r5 小时前
Java调用链MCP分析工具
java·python·ai编程
杨充5 小时前
1.3 浮点型数据设计灵魂
开发语言·python·算法
meilindehuzi_a6 小时前
深入浅出数据结构:Python 字典(Dict)与集合(Set)的哈希表底层全链路追踪
数据结构·python·散列表
Lucas凉皮6 小时前
20243408 2025-2026-2 《Python程序设计》综合实践报告
python·实验报告