目录
[NVIDIA GPU 算力速查表](#NVIDIA GPU 算力速查表)
[vllm 安装:](#vllm 安装:)
NVIDIA GPU 算力速查表
https://blog.csdn.net/jacke121/article/details/159576930
vllm 安装:
bash
pip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# for vllm>=0.11.0
pip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
python vllm_example.py
flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
库冲突:
torch 2.7.1+cu128
torchaudio 2.7.1+cu128
torchmetrics 1.9.0
torchvision 0.22.1+cu128
widgetsnbextension 4.0.15
x-transformers 2.7.2
xformers 0.0.31
xfuser 0.4.5
vllm 0.9.0
flash_attn 2.8.0.post2 不支持torch2.8
结合你的 GPU 是 RTX 5090(Compute Capability 12.0),而 vLLM 0.9.0 包含 SM 7.0-9.0 的内核。
vllm==v0.11.0 支持12.0的算力。
RTX 4080 的 Compute Capability 是 8.9,和 RTX 4090、RTX 4070 等整个 40 系列一样。
/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py
INFO 03-28 11:23:44 __init__.py:243 Automatically detected platform cuda.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
2026-03-28 11:23:52,335 INFO input frame rate=25
/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
WeightNorm.apply(module, name, dim)
/data/lbg/envs/flashtalk/lib/python3.10/site-packages/pyworld/init.py:13: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
/data/lbg/envs/flashtalk/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'
warnings.warn(
2026-03-28 11:23:56,030 INFO no frontend is avaliable
INFO 03-28 11:23:59 __init__.py:31 Available plugins for group vllm.general_plugins:
INFO 03-28 11:23:59 __init__.py:33 - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 03-28 11:23:59 __init__.py:36 All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-28 11:24:06 config.py:793 This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'.
WARNING 03-28 11:24:06 arg_utils.py:1583 --enable-prompt-embeds is not supported by the V1 Engine. Falling back to V0.
INFO 03-28 11:24:06 llm_engine.py:230 Initializing a V0 LLM engine (v0.9.0) with config: model='/data/lbg/models/CosyVoice3-0.5B/vllm', speculative_config=None, tokenizer='/data/lbg/models/CosyVoice3-0.5B/vllm', skip_tokenizer_init=True, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/lbg/models/CosyVoice3-0.5B/vllm, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"compile_sizes": \[\], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1, "max_capture_size": 256}, use_cached_outputs=False,
INFO 03-28 11:24:06 cuda.py:292 Using Flash Attention backend.
W328 11:24:17.372136289 socket.cpp:200 c10d The hostname of the client socket cannot be retrieved. err=-3
INFO 03-28 11:24:22 parallel_state.py:1064 rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 03-28 11:24:22 model_runner.py:1170 Starting to load model /data/lbg/models/CosyVoice3-0.5B/vllm...
Loading safetensors checkpoint shards: 0% Completed | 0/1 00:00\, ?it/s
Loading safetensors checkpoint shards: 100% Completed | 1/1 00:00\<00:00, 2.01it/s
Loading safetensors checkpoint shards: 100% Completed | 1/1 00:00\<00:00, 2.01it/s
INFO 03-28 11:24:23 default_loader.py:280 Loading weights took 0.52 seconds
INFO 03-28 11:24:23 model_runner.py:1202 Model loading took 0.7001 GiB and 0.599737 seconds
rank0: Traceback (most recent call last):
rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py", line 40, in <module>
rank0: main()
rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py", line 36, in main
rank0: cosyvoice3_example()
rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/vllm_example.py", line 26, in cosyvoice3_example
rank0: cosyvoice = AutoModel(model_dir='/data/lbg/models/CosyVoice3-0.5B', load_trt=True, load_vllm=True, fp16=False)
rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py", line 236, in AutoModel
rank0: return CosyVoice3(**kwargs)
rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/cosyvoice.py", line 217, in init
rank0: self.model.load_vllm('{}/vllm'.format(model_dir))
rank0: File "/data/lbg/project/cosyvoice/CosyVoice-main/cosyvoice/cli/model.py", line 288, in load_vllm
rank0: self.llm.vllm = LLMEngine.from_engine_args(engine_args)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 501, in from_engine_args
rank0: return engine_cls.from_vllm_config(
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 477, in from_vllm_config
rank0: return cls(
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 268, in init
rank0: self._initialize_kv_caches()
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches
rank0: self.model_executor.determine_num_available_blocks())
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
rank0: results = self.collective_rpc("determine_num_available_blocks")
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
rank0: answer = run_method(self.driver_worker, method, args, kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/utils.py", line 2605, in run_method
rank0: return func(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in determine_num_available_blocks
rank0: self.model_runner.profile_run()
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1299, in profile_run
rank0: self._dummy_run(max_num_batched_tokens, max_num_seqs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1425, in _dummy_run
rank0: self.execute_model(model_input, kv_caches, intermediate_tensors)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1843, in execute_model
rank0: hidden_or_intermediate_states = model_executable(
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 481, in forward
rank0: hidden_states = self.model(input_ids, positions, intermediate_tensors,
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 172, in call
rank0: return self.forward(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 358, in forward
rank0: hidden_states, residual = layer(
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 257, in forward
rank0: hidden_states = self.self_attn(
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 184, in forward
rank0: qkv, _ = self.qkv_proj(hidden_states)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 486, in forward
rank0: output_parallel = self.quant_method.apply(self, input_, bias)
rank0: File "/data/lbg/envs/flashtalk/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 202, in apply
rank0: return dispatch_unquantized_gemm()(x, layer.weight, bias)
rank0: RuntimeError: CUDA error: no kernel image is available for execution on the device
rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
rank0: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
rank0:W328 11:24:24.967138206 ProcessGroupNCCL.cpp:1479 Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())