在SCNet使用异构海光DCU 部署文心21B大模型报错HIP out of memory(未调通)

使用命令

vllm serve baidu/ERNIE-4.5-21B-A3B-Base-PT --tensor-parallel-size 4 --trust-remote-code --block-size 8 --max-model-len 4096 --gpu-memory-utilization 0.85 --dtype float --kv_cache_dtype fp8

报错:

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 63.61 GiB is allocated by PyTorch, and 896.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

现在用这条命令

复制代码
vllm serve baidu/ERNIE-4.5-21B-A3B-Base-PT --gpu-memory-utilization 0.92  --tensor-parallel-size 2 --max-num-seqs 32 --max-model-len 2000  --tensor-parallel-size 2  --dtype float16

报错

复制代码
(VllmWorkerProcess pid=31673) INFO 10-15 15:16:56 [model_runner.py:1156] Model loading took 40.3319 GiB and 63.837660 seconds
INFO 10-15 15:16:56 [model_runner.py:1156] Model loading took 40.3348 GiB and 63.986810 seconds
(VllmWorkerProcess pid=31673) /usr/local/lib/python3.10/dist-packages/vllm/attention/backends/rocm_flash_attn.py:1025: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
(VllmWorkerProcess pid=31673)   sub_out = torch.nn.functional.scaled_dot_product_attention(
/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/rocm_flash_attn.py:1025: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)
  sub_out = torch.nn.functional.scaled_dot_product_attention(
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
ERROR 10-15 15:17:04 [engine.py:453] expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
ERROR 10-15 15:17:04 [engine.py:453] Traceback (most recent call last):
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 441, in run_mp_engine
ERROR 10-15 15:17:04 [engine.py:453]     engine = MQLLMEngine.from_vllm_config(
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 10-15 15:17:04 [engine.py:453]     return cls(
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 87, in __init__
ERROR 10-15 15:17:04 [engine.py:453]     self.engine = LLMEngine(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 280, in __init__
ERROR 10-15 15:17:04 [engine.py:453]     self._initialize_kv_caches()
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 427, in _initialize_kv_caches
ERROR 10-15 15:17:04 [engine.py:453]     self.model_executor.determine_num_available_blocks())
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 104, in determine_num_available_blocks
ERROR 10-15 15:17:04 [engine.py:453]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 334, in collective_rpc
ERROR 10-15 15:17:04 [engine.py:453]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 10-15 15:17:04 [engine.py:453]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2595, in run_method
ERROR 10-15 15:17:04 [engine.py:453]     return func(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-15 15:17:04 [engine.py:453]     return func(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in determine_num_available_blocks
ERROR 10-15 15:17:04 [engine.py:453]     self.model_runner.profile_run()
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-15 15:17:04 [engine.py:453]     return func(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1253, in profile_run
ERROR 10-15 15:17:04 [engine.py:453]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1379, in _dummy_run
ERROR 10-15 15:17:04 [engine.py:453]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-15 15:17:04 [engine.py:453]     return func(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1796, in execute_model
ERROR 10-15 15:17:04 [engine.py:453]     hidden_or_intermediate_states = model_executable(
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 10-15 15:17:04 [engine.py:453]     return self.forward(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/transformers.py", line 422, in forward
ERROR 10-15 15:17:04 [engine.py:453]     model_output = self.model(input_ids, positions, intermediate_tensors,
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return self._call_impl(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return forward_call(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/transformers.py", line 329, in forward
ERROR 10-15 15:17:04 [engine.py:453]     hidden_states = self.model(
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return self._call_impl(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return forward_call(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 1064, in wrapper
ERROR 10-15 15:17:04 [engine.py:453]     outputs = func(self, *args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py", line 558, in forward
ERROR 10-15 15:17:04 [engine.py:453]     hidden_states = decoder_layer(
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_layers.py", line 94, in __call__
ERROR 10-15 15:17:04 [engine.py:453]     return super().__call__(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return self._call_impl(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return forward_call(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
ERROR 10-15 15:17:04 [engine.py:453]     return func(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py", line 459, in forward
ERROR 10-15 15:17:04 [engine.py:453]     hidden_states = self.mlp(hidden_states)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return self._call_impl(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return forward_call(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py", line 344, in forward
ERROR 10-15 15:17:04 [engine.py:453]     router_logits = self.gate(hidden_states.float())
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return self._call_impl(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-15 15:17:04 [engine.py:453]     return forward_call(*args, **kwargs)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 352, in forward
ERROR 10-15 15:17:04 [engine.py:453]     output = self.quant_method.apply(self, x, bias)
ERROR 10-15 15:17:04 [engine.py:453]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 222, in apply
ERROR 10-15 15:17:04 [engine.py:453]     return dispatch_unquantized_gemm()(x, layer.weight, bias)
ERROR 10-15 15:17:04 [engine.py:453] RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2595, in run_method
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in determine_num_available_blocks
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     self.model_runner.profile_run()
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1253, in profile_run
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1379, in _dummy_run
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1796, in execute_model
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/transformers.py", line 422, in forward
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     model_output = self.model(input_ids, positions, intermediate_tensors,
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/transformers.py", line 329, in forward
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     hidden_states = self.model(
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 1064, in wrapper
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     outputs = func(self, *args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py", line 558, in forward
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     hidden_states = decoder_layer(
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_layers.py", line 94, in __call__
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return super().__call__(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py", line 459, in forward
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py", line 344, in forward
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     router_logits = self.gate(hidden_states.float())
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 352, in forward
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     output = self.quant_method.apply(self, x, bias)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 222, in apply
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238]     return dispatch_unquantized_gemm()(x, layer.weight, bias)
(VllmWorkerProcess pid=31673) ERROR 10-15 15:17:04 [multiproc_worker_utils.py:238] RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
INFO 10-15 15:17:04 [multiproc_worker_utils.py:124] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 455, in run_mp_engine
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 441, in run_mp_engine
    engine = MQLLMEngine.from_vllm_config(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
    return cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 87, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 280, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 427, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 104, in determine_num_available_blocks
    results = self.collective_rpc("determine_num_available_blocks")
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 334, in collective_rpc
    return self._run_workers(method, *args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
    driver_worker_output = run_method(self.driver_worker, sent_method,
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2595, in run_method
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1253, in profile_run
    self._dummy_run(max_num_batched_tokens, max_num_seqs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1379, in _dummy_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1796, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
    return self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/transformers.py", line 422, in forward
    model_output = self.model(input_ids, positions, intermediate_tensors,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/transformers.py", line 329, in forward
    hidden_states = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 1064, in wrapper
    outputs = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py", line 558, in forward
    hidden_states = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py", line 459, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py", line 344, in forward
    router_logits = self.gate(hidden_states.float())
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 352, in forward
    output = self.quant_method.apply(self, x, bias)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 222, in apply
    return dispatch_unquantized_gemm()(x, layer.weight, bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/cli/main.py", line 53, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
root@notebook-1978259446016311297-ac7sc1ejvp-24335:/# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
 
root@notebook-1978259446016311297-ac7sc1ejvp-24335:/# echo $HIP_VISIBLE_DEVICES

改成bfloat试试

复制代码
vllm serve baidu/ERNIE-4.5-21B-A3B-Base-PT --gpu-memory-utilization 0.92  --tensor-parallel-size 2 --max-num-seqs 32 --max-model-len 2000  --tensor-parallel-size 2  --dtype bfloat16

现在出来了新的报错:

(VllmWorkerProcess pid=2441) INFO 10-15 15:23:03 [model_runner.py:1156] Model loading took 40.3319 GiB and 24.727362 seconds

/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/rocm_flash_attn.py:1025: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)

sub_out = torch.nn.functional.scaled_dot_product_attention(

(VllmWorkerProcess pid=2441) /usr/local/lib/python3.10/dist-packages/vllm/attention/backends/rocm_flash_attn.py:1025: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)

用deepseek 7b模型试试

复制代码
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B  --dtype float16  --trust-remote-code --gpu-memory-utilization 0.98 --tensor-parallel-size 2  --max-num-seqs 32 --max-model-len 40000

你看看,人家这就好好的啊

复制代码
INFO 10-15 15:38:13 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.95}
INFO 10-15 15:38:13 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000
INFO 10-15 15:38:13 [launcher.py:28] Available routes are:
INFO 10-15 15:38:13 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD

显存占到90% ,这样看,是不是2卡不行,需要4卡啊,问题是我四卡也没跑成啊!

没搞定,1卡 2卡 4卡都没跑通....

相关推荐
科研前沿3 分钟前
纯视觉无感解算 + 动态数字孪生:室内外无感定位技术全新升级
大数据·人工智能·算法·重构·空间计算
暗夜猎手-大魔王4 分钟前
转载--AI Agent 架构设计:错误处理与容错设计(OpenClaw、Claude Code、Hermes Agent 对比)
人工智能
码农的神经元9 分钟前
Claude Code 如何接入 DeepSeek V4 模型:从安装配置到实战验证
人工智能
波动几何18 分钟前
通用行业业务技能体系技能universal-business-skill-system
人工智能
Robot_Nav20 分钟前
AI 编程助手 Skill 完全指南:VS Code · Trae CN · Claude Code
人工智能·vscode·skill·trae·claude code
直奔標竿22 分钟前
Java开发者AI转型第二十五课!Spring AI 个人知识库实战(四)——RAG来源追溯落地,拒绝AI幻觉
java·开发语言·人工智能·spring boot·后端·spring
段一凡-华北理工大学22 分钟前
【高炉炼铁领域炉温监测、预警、调控智能体设计与应用】~系列文章06:智能决策:从经验驱动到数据驱动
网络·人工智能·数据挖掘·高炉炼铁·工业智能体·高炉炉温
rainbow72424425 分钟前
企业级AI人才培养方案:如何设计“训战结合”的学习项目
人工智能
郑寿昌25 分钟前
2026全球AI模型巅峰对决:谁主沉浮?
人工智能
Magic-Yuan28 分钟前
鸿沟即机遇
人工智能