试试这个模型Qwen3-Coder-30B-A3B-Instruct

/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct

还有这个：Qwen3-30B-A3B-Instruct-2507-FP8

复制代码

vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct

环境搭建参见：https://blog.csdn.net/skywalk8163/article/details/155802417

升级vllm

到SCNet 帮助页面：基础环境搭建-以常用框架vLLM为例 | 超算互联网，获得vllm光合社区的链接

https://download.sourcefind.cn:65024/directlink/4/vllm/DAS1.7/vllm-0.9.2+das.opt1.dtk25042-cp310-cp310-manylinux_2_28_x86_64.whl

下载到本地，

复制代码

wget https://download.sourcefind.cn:65024/directlink/4/vllm/DAS1.7/vllm-0.9.2+das.opt1.dtk25042-cp310-cp310-manylinux_2_28_x86_64.whl

然后安装

复制代码

pip install vllm-0.9.2+das.opt1.dtk25042-cp310-cp310-manylinux_2_28_x86_64.whl

推理

复制代码

vllm serve "/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct"

尝试# Load and run the model:

vllm serve "lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit"

复制代码

# Load and run the model:
vllm serve "lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit"

试试这个：

这个：/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-30B-A3B-Instruct-2507-FP8/main/Qwen3-30B-A3B-Instruct-2507-FP8

复制代码

vllm serve "/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-30B-A3B-Instruct-2507-FP8/main/Qwen3-30B-A3B-Instruct-2507-FP8"

不行，加上这两个参数--gpu_memory_utilization 0.95 --max_model_len 68152看看，感觉看到曙光了

复制代码

vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.95 --max_model_len 68152

还是大了：

ValueError: The model's max seq len (68152) is larger than the maximum number of tokens that can be stored in KV cache (56448). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

复制代码

vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.98 --max_model_len 68152

太棒了，成了啊！

尽管max len只有64k，不过勉强可以一用啊！突然发现这就是原来的32b模型啊，这个模型不符合要求。

（前面调试晕头了，用错了模型，ε=(´ο｀*)))唉）

浏览模型列表

复制代码

https://c-1998971694380531714.ksai.scnet.cn:58043/v1/models

输出：

复制代码

{"object":"list","data":[{"id":"/root/public_data/model/admin/qwq-32b-gptq-int8","object":"model","created":1765437578,"owned_by":"vllm","root":"/root/public_data/model/admin/qwq-32b-gptq-int8","parent":null,"max_model_len":65536,"permission":[{"id":"modelperm-cbda0c41725e4f73896a52ba97da588b","object":"model_permission","created":1765437578,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

这样url：https://c-1998971694380531714.ksai.scnet.cn:58043

复制代码

https://c-1998971694380531714.ksai.scnet.cn:58043/v1/

模型名字：

复制代码

"/root/public_data/model/admin/qwq-32b-gptq-int8"

tokenkey：hello

测试这个模型Qwen3-30B-A3B-Instruct-2507-FP8

/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-30B-A3B-Instruct-2507-FP8/main/Qwen3-30B-A3B-Instruct-2507-FP8

复制代码

vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-30B-A3B-Instruct-2507-FP8/main/Qwen3-30B-A3B-Instruct-2507-FP8

但是这个不适合这个平台：

The model class Qwen3MoeForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules

/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py:79: UserWarning: expandable_segments not supported on this platform

用这个模型Mistral-Small-3.2-24B-Instruct-2506

/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Mistral-Small-3.2-24B-Instruct-2506/main/Mistral-Small-3.2-24B-Instruct-2506

复制代码

vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Mistral-Small-3.2-24B-Instruct-2506/main/Mistral-Small-3.2-24B-Instruct-2506

换用这个模型deepseek-coder-6.7b-instruct

/public/home/ac7sc1ejvp/SothisAI/model/Aihub/deepseek-coder-6.7b-instruct/main/deepseek-coder-6.7b-instruct

复制代码

vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/deepseek-coder-6.7b-instruct/main/deepseek-coder-6.7b-instruct

最后换成这个Qwen3-30B-A3B

vllm启动命令：

复制代码

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve "/root/public_data/model/admin/Qwen/Qwen3-30B-A3B" --served-model-name "Qwen3-30B-A3B" --tensor-parallel-size 2 --max-model-len 65536 --dtype bfloat16 --enforce-eager --host 0.0.0.0 --port 8000 --disable-log-stats --enable-reasoning --reasoning-parser deepseek_r1

使用双dcu卡启动Qwen3-30B-A3B模型，成功！

地址

复制代码

https://c-1998971694380531714.ksai.scnet.cn:58043/v1/

模型名字：Qwen3-30B-A3B

DCU计算卡跑满：

============================ System Management Interface =============================

======================================================================================

DCU Temp AvgPwr Perf PwrCap VRAM% DCU% Mode

0 54.0C 226.0W manual 300.0W 93% 100.0% Normal

1 55.0C 224.0W manual 300.0W 92% 100.0% Normal

======================================================================================

=================================== End of SMI Log ===================================

Auto-coder验证

配置模型

复制代码

/models /add_provider name=qwen3_30b model_name=Qwen3-30B-A3B base_url=https://c-1998971694380
531714.ksai.scnet.cn:58043/v1/ api_key=hello

给它一个任务，让它做一个翻译插件：帮我做一个chrome和edge的浏览器翻译插件，要求能选词翻译，能翻译整个网页。翻译功能使用openai调

用ai大模型实现，要求能配置常见的几款大模型，并能自定义兼容openai的大模型。

我的天啊，在90多轮后，终于开始进入正题了！

复制代码

conversation tokens: 37280 (conversation round: 103)
<inner_thinking>
Okay, let's see. The user wants me to help create a browser extension for Chrome and Edge that can translate selected words and entire web pages using OpenAI's AI models. They also want the ability to configure different models and customize compatible ones.

First, I need to figure out the steps required to build such an extension. Let me start by recalling what I know about browser extensions. They typically consist of HTML, CSS, and JavaScript files, along with a manifest file that defines their properties and permissions.

The main features needed are:
1. **Context Menu for Word Selection**: The extension should add a context menu item that allows users to select a word or phrase and trigger the translation.
2. **Page Translation**: A way to translate the entire webpage, possibly by injecting a script into the page or using a background script to process the content.
3. **OpenAI Integration**: Use OpenAI's API to perform translations. This would require an API key, which the user would need to provide. The extension would send the text to OpenAI's servers and receive the translated text.
4. **Model Configuration**: Allow users to select different models (like GPT-3.5, GPT-4) and possibly add custom models that are compatible with OpenAI's API.
5. **User Interface**: A popup or options page where users can configure the extension, including selecting the model, entering the API key, and adjusting other settings.

Next, I should consider the structure of the extension. The manifest file (manifest.json) will be crucial. It needs to specify the permissions, background scripts, content scripts, and context menu items.

For the context menu, I'll need to use the `chrome.contextMenus` API to create a menu item. When the user selects text and right-clicks, the extension can capture the selected text and send it to the translation service.

For translating the entire page, the extension might need to inject a content script that traverses the DOM, extracts text, and replaces it with the translated version. However, this could be complex due to dynamic content and different elements. Alternatively, the extension could use a background script to fetch the page content, translate it, and then inject the translated content back into the page.

Integrating OpenAI's API will require handling API requests. The extension will need to send the text to the OpenAI endpoint, which involves setting up the correct headers, including the API key, and parsing the response. It's important to handle errors, like invalid API keys or rate limits.

Model configuration can be stored in the extension's local storage, allowing users to select their preferred model through the options page. The options page can be a simple HTML file with form elements to input the API key and select the model.

Security is a concern here. Storing API keys in the extension's code isn't secure, so it's better to have users input their keys through the options page, which are then stored securely in the browser's storage.

Testing the extension will involve creating a sample webpage with different text elements to ensure that both word selection and page translation work as expected. Also, testing with different models and checking for any API rate limits or errors.

Potential challenges include handling different languages, ensuring the translation is accurate, and managing the performance of the extension, especially when translating large pages. Also, ensuring that the extension doesn't interfere with the normal functioning of the webpage.

I should also consider the user experience. The extension should be easy to install, configure, and use. Providing clear instructions and a user-friendly interface will be important.

Finally, packaging the extension for distribution on the Chrome Web Store and Edge Add-ons. This involves creating a ZIP file of the extension's files and following the respective store's guidelines.

也就是30b这个模型可以一战！

总结

最后我都忘记用的是哪个模型了

总之30b模型可以一战，但是我还是要找更好的模型。

调试

vllm启动报错

复制代码

vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct

    uvloop.run(run_server(args))
  File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 182, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 897, in create_engine_config
    model_config = self.create_model_config()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 834, in create_model_config
    return ModelConfig(
  File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 183, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 134, in get_config
    file_exists(model,
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct'. Use `repo_type` argument if needed.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 134, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 897, in create_engine_config
    model_config = self.create_model_config()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 834, in create_model_config
    return ModelConfig(
  File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 183, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 134, in get_config
    file_exists(model,
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct'. Use `repo_type` argument if needed.

vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct

给模型加上括号

复制代码

vllm serve "/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct"

出现新的报错

vllm启动报错The checkpoint you are trying to load has model type `qwen3_moe` but Transformers does not recognize this architecture

复制代码

ValueError: The checkpoint you are trying to load has model type `qwen3_moe` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1092, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 794, in __getitem__
    raise KeyError(key)
KeyError: 'qwen3_moe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 134, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 897, in create_engine_config
    model_config = self.create_model_config()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 834, in create_model_config
    return ModelConfig(
  File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 183, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 173, in get_config
    raise e
  File "/opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 154, in get_config
    config = AutoConfig.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1094, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `qwen3_moe` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

安装提示直接来：

复制代码

pip install --upgrade transformers

升级了：

Uninstalling transformers-4.49.0:

Successfully uninstalled transformers-4.49.0

Successfully installed hf-xet-1.2.0 huggingface-hub-0.36.0 tokenizers-0.22.1 transformers-4.57.3

现在出现新的报错

vllm报错Model architectures ['Qwen3MoeForCausalLM'] are not supported for now.

复制代码

INFO 12-11 11:26:33 model_runner.py:1041] Starting to load model /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct...
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
    return cls(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
    self.model_executor = executor_class(
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
    self.driver_worker.load_model()
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
    self.model_runner.load_model()
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1043, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 402, in load_model
    model = _initialize_model(model_config, self.load_config,
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 177, in _initialize_model
    model_class, _ = get_model_architecture(model_config)
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 67, in get_model_architecture
    return ModelRegistry.resolve_model_cls(architectures)
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/__init__.py", line 188, in resolve_model_cls
    raise ValueError(
ValueError: Model architectures ['Qwen3MoeForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'Qwen2VLForConditionalGeneration', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'TeleChat12BForCausalLM', 'SolarForCausalLM', 'ArcticForCausalLM', 'XverseForCausalLM', 'Phi3SmallForCausalLM', 'MedusaModel', 'EAGLEModel', 'MLPSpeculatorPreTrainedModel', 'JambaForCausalLM', 'GraniteForCausalLM', 'MistralModel', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'PaliGemmaForConditionalGeneration', 'Phi3VForCausalLM', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'UltravoxModel', 'MllamaForConditionalGeneration', 'BartModel', 'BartForConditionalGeneration']
I1211 11:26:34.671990  6097 ProcessGroupNCCL.cpp:1126] [PG 0 Rank 0] ProcessGroupNCCL destructor entered.
I1211 11:26:34.672050  6097 ProcessGroupNCCL.cpp:1111] [PG 0 Rank 0] Launching ProcessGroupNCCL abort asynchrounously.
I1211 11:26:34.672595  6097 ProcessGroupNCCL.cpp:1016] [PG 0 Rank 0] future is successfully executed for: ProcessGroup abort
I1211 11:26:34.672617  6097 ProcessGroupNCCL.cpp:1117] [PG 0 Rank 0] ProcessGroupNCCL aborts successfully.
I1211 11:26:34.672636  6097 ProcessGroupNCCL.cpp:1149] [PG 0 Rank 0] ProcessGroupNCCL watchdog thread joined.
I1211 11:26:34.672698  6097 ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL heart beat monitor thread joined.
Traceback (most recent call last):
  File "/opt/conda/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 165, in main
    args.dispatch_function(args)
  File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 37, in serve
    uvloop.run(run_server(args))
  File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start

又到这个问题了，看来需要升级vllm了啊

升级之后报错torch.OutOfMemoryError: HIP out of memory

复制代码

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 816.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 58.57 GiB is allocated by PyTorch, and 4.71 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

加上--gpu_memory_utilization 0.95

复制代码

vllm serve "/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct" --gpu_memory_utilization 0.90

还是不行，这个816Mib怎么解决啊

复制代码

    return torch._C._nn.pad(input, pad, mode, value)
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 816.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 58.57 GiB is allocated by PyTorch, and 4.71 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

看看这里面有没有能减小的

INFO 12-11 13:02:09 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.4) with config: model='/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct', speculative_config=None, tokenizer='/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,

试试这个：--max_seq_len 128144

换用双卡，报错

huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct'. Use `repo_type` argument if needed.

先搁置。

发现双卡可以启动这个模型，镜像直接就带了启动文件。

双卡启动qwen3 30b，auto-coder报错

╭─ 🔥 Error ───────────────────────────────────────────────────────────────────────────────────────────────────────────╮

│ Error: Connection failed. Error: Error code: 400 - {'object': 'error', 'message': "This model's maximum context │

│ length is 4096 tokens. However, you requested 30424 tokens (22328 in the messages, 8096 in the completion). Please │

│ reduce the length of the messages or completion.", 'type': 'BadRequestError', 'param': None, 'code': 400}

换这样

复制代码

vllm serve "/root/public_data/model/admin/Qwen/Qwen3-30B-A3B" --served-model-name "Qwen3-30B-A3B" --tensor-parallel-size 2 --max-model-len 65536 --dtype bfloat16 --enforce-eager --host 0.0.0.0 --port 8000 --disable-log-stats --enable-reasoning --reasoning-parser deepseek_r1

最后换成这样，总算跑起来了