目录
[NVIDIA GPU 算力速查表](#NVIDIA GPU 算力速查表)
[vllm 安装:](#vllm 安装:)
NVIDIA GPU 算力速查表
https://blog.csdn.net/jacke121/article/details/159576930
vllm 安装:
bash
pip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# for vllm>=0.11.0
pip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
python vllm_example.py
flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
库冲突:
torch 2.7.1+cu128
torchaudio 2.7.1+cu128
torchmetrics 1.9.0
torchvision 0.22.1+cu128
widgetsnbextension 4.0.15
x-transformers 2.7.2
xformers 0.0.31
xfuser 0.4.5
vllm 0.9.0
flash_attn 2.8.0.post2 不支持torch2.8
结合你的 GPU 是 RTX 5090(Compute Capability 12.0),而 vLLM 0.9.0 包含 SM 7.0-9.0 的内核。
vllm==v0.11.0 支持12.0的算力。
RTX 4080 的 Compute Capability 是 8.9,和 RTX 4090、RTX 4070 等整个 40 系列一样。
/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py
INFO 03-28 11:23:44 [init.py:243] Automatically detected platform cuda.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
2026-03-28 11:23:52,335 INFO input frame rate=25
/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
WeightNorm.apply(module, name, dim)
/data/lbg/envs/flashtalk/lib/python3.10/site-packages/pyworld/init.py:13: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
/data/lbg/envs/flashtalk/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'
warnings.warn(
2026-03-28 11:23:56,030 INFO no frontend is avaliable
INFO 03-28 11:23:59 [init.py:31] Available plugins for group vllm.general_plugins:
INFO 03-28 11:23:59 [init.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 03-28 11:23:59 [init.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-28 11:24:06 [config.py:793] This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'.
WARNING 03-28 11:24:06 [arg_utils.py:1583] --enable-prompt-embeds is not supported by the V1 Engine. Falling back to V0.
INFO 03-28 11:24:06 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.0) with config: model='/data/lbg/models/CosyVoice3-0.5B/vllm', speculative_config=None, tokenizer='/data/lbg/models/CosyVoice3-0.5B/vllm', skip_tokenizer_init=True, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/lbg/models/CosyVoice3-0.5B/vllm, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=False,
INFO 03-28 11:24:06 [cuda.py:292] Using Flash Attention backend.
W328 11:24:17.372136289 socket.cpp:200\] \[c10d\] The hostname of the client socket cannot be retrieved. err=-3 INFO 03-28 11:24:22 \[parallel_state.py:1064\] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 INFO 03-28 11:24:22 \[model_runner.py:1170\] Starting to load model /data/lbg/models/CosyVoice3-0.5B/vllm... Loading safetensors checkpoint shards: 0% Completed \| 0/1 \[00:00\, ?it/s
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.01it/s]
INFO 03-28 11:24:23 [default_loader.py:280] Loading weights took 0.52 seconds
INFO 03-28 11:24:23 [model_runner.py:1202] Model loading took 0.7001 GiB and 0.599737 seconds
rank0\]: Traceback (most recent call last): \[rank0\]: File "/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py", line 40, in \
\[rank0\]: main() \[rank0\]: File "/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py", line 36, in main \[rank0\]: cosyvoice3_example() \[rank0\]: File "/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py", line 26, in cosyvoice3_example \[rank0\]: cosyvoice = AutoModel(model_dir='/data/lbg/models/CosyVoice3-0.5B', load_trt=True, load_vllm=True, fp16=False) \[rank0\]: File "/data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py", line 236, in AutoModel \[rank0\]: return CosyVoice3(\*\*kwargs) \[rank0\]: File "/data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py", line 217, in __init__ \[rank0\]: self.model.load_vllm('{}/vllm'.format(model_dir)) \[rank0\]: File "/data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/model.py", line 288, in load_vllm \[rank0\]: self.llm.vllm = LLMEngine.from_engine_args(engine_args) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 501, in from_engine_args \[rank0\]: return engine_cls.from_vllm_config( \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 477, in from_vllm_config \[rank0\]: return cls( \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 268, in __init__ \[rank0\]: self._initialize_kv_caches() \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches \[rank0\]: self.model_executor.determine_num_available_blocks()) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks \[rank0\]: results = self.collective_rpc("determine_num_available_blocks") \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc \[rank0\]: answer = run_method(self.driver_worker, method, args, kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/utils.py", line 2605, in run_method \[rank0\]: return func(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context \[rank0\]: return func(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in determine_num_available_blocks \[rank0\]: self.model_runner.profile_run() \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context \[rank0\]: return func(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1299, in profile_run \[rank0\]: self._dummy_run(max_num_batched_tokens, max_num_seqs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1425, in _dummy_run \[rank0\]: self.execute_model(model_input, kv_caches, intermediate_tensors) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context \[rank0\]: return func(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1843, in execute_model \[rank0\]: hidden_or_intermediate_states = model_executable( \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl \[rank0\]: return self._call_impl(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl \[rank0\]: return forward_call(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 481, in forward \[rank0\]: hidden_states = self.model(input_ids, positions, intermediate_tensors, \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 172, in __call__ \[rank0\]: return self.forward(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 358, in forward \[rank0\]: hidden_states, residual = layer( \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl \[rank0\]: return self._call_impl(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl \[rank0\]: return forward_call(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 257, in forward \[rank0\]: hidden_states = self.self_attn( \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl \[rank0\]: return self._call_impl(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl \[rank0\]: return forward_call(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 184, in forward \[rank0\]: qkv, _ = self.qkv_proj(hidden_states) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl \[rank0\]: return self._call_impl(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl \[rank0\]: return forward_call(\*args, \*\*kwargs) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 486, in forward \[rank0\]: output_parallel = self.quant_method.apply(self, input_, bias) \[rank0\]: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 202, in apply \[rank0\]: return dispatch_unquantized_gemm()(x, layer.weight, bias) \[rank0\]: RuntimeError: CUDA error: no kernel image is available for execution on the device \[rank0\]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. \[rank0\]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1 \[rank0\]: Compile with \`TORCH_USE_CUDA_DSA\` to enable device-side assertions. \[rank0\]:\[W328 11:24:24.967138206 ProcessGroupNCCL.cpp:1479\] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())