试试这个模型Qwen3-Coder-30B-A3B-Instruct
/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct
还有这个:Qwen3-30B-A3B-Instruct-2507-FP8
vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct
环境搭建参见:https://blog.csdn.net/skywalk8163/article/details/155802417
升级vllm
到SCNet 帮助页面:基础环境搭建-以常用框架vLLM为例 | 超算互联网,获得vllm光合社区的链接
下载到本地,
wget https://download.sourcefind.cn:65024/directlink/4/vllm/DAS1.7/vllm-0.9.2+das.opt1.dtk25042-cp310-cp310-manylinux_2_28_x86_64.whl
然后安装
pip install vllm-0.9.2+das.opt1.dtk25042-cp310-cp310-manylinux_2_28_x86_64.whl
推理
vllm serve "/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct"
尝试# Load and run the model:
vllm serve "lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit"
# Load and run the model:
vllm serve "lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit"
试试这个:
这个:/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-30B-A3B-Instruct-2507-FP8/main/Qwen3-30B-A3B-Instruct-2507-FP8
vllm serve "/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-30B-A3B-Instruct-2507-FP8/main/Qwen3-30B-A3B-Instruct-2507-FP8"
不行,加上这两个参数--gpu_memory_utilization 0.95 --max_model_len 68152看看,感觉看到曙光了
vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.95 --max_model_len 68152
还是大了:
ValueError: The model's max seq len (68152) is larger than the maximum number of tokens that can be stored in KV cache (56448). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.98 --max_model_len 68152
太棒了,成了啊!
尽管max len只有64k,不过勉强可以一用啊!突然发现这就是原来的32b模型啊,这个模型不符合要求。
(前面调试晕头了,用错了模型,ε=(´ο`*)))唉)
浏览模型列表
https://c-1998971694380531714.ksai.scnet.cn:58043/v1/models
输出:
{"object":"list","data":[{"id":"/root/public_data/model/admin/qwq-32b-gptq-int8","object":"model","created":1765437578,"owned_by":"vllm","root":"/root/public_data/model/admin/qwq-32b-gptq-int8","parent":null,"max_model_len":65536,"permission":[{"id":"modelperm-cbda0c41725e4f73896a52ba97da588b","object":"model_permission","created":1765437578,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
这样url:https://c-1998971694380531714.ksai.scnet.cn:58043
https://c-1998971694380531714.ksai.scnet.cn:58043/v1/
模型名字:
"/root/public_data/model/admin/qwq-32b-gptq-int8"
tokenkey:hello
测试这个模型Qwen3-30B-A3B-Instruct-2507-FP8
/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-30B-A3B-Instruct-2507-FP8/main/Qwen3-30B-A3B-Instruct-2507-FP8
vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-30B-A3B-Instruct-2507-FP8/main/Qwen3-30B-A3B-Instruct-2507-FP8
但是这个不适合这个平台:
The model class Qwen3MoeForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py:79: UserWarning: expandable_segments not supported on this platform
用这个模型Mistral-Small-3.2-24B-Instruct-2506
/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Mistral-Small-3.2-24B-Instruct-2506/main/Mistral-Small-3.2-24B-Instruct-2506
vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Mistral-Small-3.2-24B-Instruct-2506/main/Mistral-Small-3.2-24B-Instruct-2506
换用这个模型deepseek-coder-6.7b-instruct
/public/home/ac7sc1ejvp/SothisAI/model/Aihub/deepseek-coder-6.7b-instruct/main/deepseek-coder-6.7b-instruct
vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/deepseek-coder-6.7b-instruct/main/deepseek-coder-6.7b-instruct
最后换成这个Qwen3-30B-A3B
vllm启动命令:
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve "/root/public_data/model/admin/Qwen/Qwen3-30B-A3B" --served-model-name "Qwen3-30B-A3B" --tensor-parallel-size 2 --max-model-len 65536 --dtype bfloat16 --enforce-eager --host 0.0.0.0 --port 8000 --disable-log-stats --enable-reasoning --reasoning-parser deepseek_r1
使用双dcu卡启动Qwen3-30B-A3B模型,成功!
地址
https://c-1998971694380531714.ksai.scnet.cn:58043/v1/
模型名字:Qwen3-30B-A3B
DCU计算卡跑满:
============================ System Management Interface =============================
======================================================================================
DCU Temp AvgPwr Perf PwrCap VRAM% DCU% Mode
0 54.0C 226.0W manual 300.0W 93% 100.0% Normal
1 55.0C 224.0W manual 300.0W 92% 100.0% Normal
======================================================================================
=================================== End of SMI Log ===================================
Auto-coder验证
配置模型
/models /add_provider name=qwen3_30b model_name=Qwen3-30B-A3B base_url=https://c-1998971694380
531714.ksai.scnet.cn:58043/v1/ api_key=hello
给它一个任务,让它做一个翻译插件:帮我做一个chrome和edge的浏览器翻译插件,要求能选词翻译,能翻译整个网页。 翻译功能使用openai调
用ai大模型实现,要求能配置常见的几款大模型,并能自定义兼容openai的大模型。
我的天啊,在90多轮后,终于开始进入正题了!
conversation tokens: 37280 (conversation round: 103)
<inner_thinking>
Okay, let's see. The user wants me to help create a browser extension for Chrome and Edge that can translate selected words and entire web pages using OpenAI's AI models. They also want the ability to configure different models and customize compatible ones.
First, I need to figure out the steps required to build such an extension. Let me start by recalling what I know about browser extensions. They typically consist of HTML, CSS, and JavaScript files, along with a manifest file that defines their properties and permissions.
The main features needed are:
1. **Context Menu for Word Selection**: The extension should add a context menu item that allows users to select a word or phrase and trigger the translation.
2. **Page Translation**: A way to translate the entire webpage, possibly by injecting a script into the page or using a background script to process the content.
3. **OpenAI Integration**: Use OpenAI's API to perform translations. This would require an API key, which the user would need to provide. The extension would send the text to OpenAI's servers and receive the translated text.
4. **Model Configuration**: Allow users to select different models (like GPT-3.5, GPT-4) and possibly add custom models that are compatible with OpenAI's API.
5. **User Interface**: A popup or options page where users can configure the extension, including selecting the model, entering the API key, and adjusting other settings.
Next, I should consider the structure of the extension. The manifest file (manifest.json) will be crucial. It needs to specify the permissions, background scripts, content scripts, and context menu items.
For the context menu, I'll need to use the `chrome.contextMenus` API to create a menu item. When the user selects text and right-clicks, the extension can capture the selected text and send it to the translation service.
For translating the entire page, the extension might need to inject a content script that traverses the DOM, extracts text, and replaces it with the translated version. However, this could be complex due to dynamic content and different elements. Alternatively, the extension could use a background script to fetch the page content, translate it, and then inject the translated content back into the page.
Integrating OpenAI's API will require handling API requests. The extension will need to send the text to the OpenAI endpoint, which involves setting up the correct headers, including the API key, and parsing the response. It's important to handle errors, like invalid API keys or rate limits.
Model configuration can be stored in the extension's local storage, allowing users to select their preferred model through the options page. The options page can be a simple HTML file with form elements to input the API key and select the model.
Security is a concern here. Storing API keys in the extension's code isn't secure, so it's better to have users input their keys through the options page, which are then stored securely in the browser's storage.
Testing the extension will involve creating a sample webpage with different text elements to ensure that both word selection and page translation work as expected. Also, testing with different models and checking for any API rate limits or errors.
Potential challenges include handling different languages, ensuring the translation is accurate, and managing the performance of the extension, especially when translating large pages. Also, ensuring that the extension doesn't interfere with the normal functioning of the webpage.
I should also consider the user experience. The extension should be easy to install, configure, and use. Providing clear instructions and a user-friendly interface will be important.
Finally, packaging the extension for distribution on the Chrome Web Store and Edge Add-ons. This involves creating a ZIP file of the extension's files and following the respective store's guidelines.
也就是30b这个模型可以一战!
总结
最后我都忘记用的是哪个模型了
总之30b模型可以一战,但是我还是要找更好的模型。
调试
vllm启动报错
vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct
uvloop.run(run_server(args))
File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
async with build_async_engine_client(args) as engine_client:
File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 182, in build_async_engine_client_from_engine_args
engine_config = engine_args.create_engine_config()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 897, in create_engine_config
model_config = self.create_model_config()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 834, in create_model_config
return ModelConfig(
File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 183, in __init__
self.hf_config = get_config(self.model, trust_remote_code, revision,
File "/opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 134, in get_config
file_exists(model,
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct'. Use `repo_type` argument if needed.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 134, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 897, in create_engine_config
model_config = self.create_model_config()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 834, in create_model_config
return ModelConfig(
File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 183, in __init__
self.hf_config = get_config(self.model, trust_remote_code, revision,
File "/opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 134, in get_config
file_exists(model,
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct'. Use `repo_type` argument if needed.
vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct
给模型加上括号
vllm serve "/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct"
出现新的报错
vllm启动报错The checkpoint you are trying to load has model type `qwen3_moe` but Transformers does not recognize this architecture
ValueError: The checkpoint you are trying to load has model type `qwen3_moe` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
Process SpawnProcess-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1092, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 794, in __getitem__
raise KeyError(key)
KeyError: 'qwen3_moe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 134, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 897, in create_engine_config
model_config = self.create_model_config()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 834, in create_model_config
return ModelConfig(
File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 183, in __init__
self.hf_config = get_config(self.model, trust_remote_code, revision,
File "/opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 173, in get_config
raise e
File "/opt/conda/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 154, in get_config
config = AutoConfig.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1094, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type `qwen3_moe` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
安装提示直接来:
pip install --upgrade transformers
升级了:
Uninstalling transformers-4.49.0:
Successfully uninstalled transformers-4.49.0
Successfully installed hf-xet-1.2.0 huggingface-hub-0.36.0 tokenizers-0.22.1 transformers-4.57.3
现在出现新的报错
vllm报错Model architectures ['Qwen3MoeForCausalLM'] are not supported for now.
INFO 12-11 11:26:33 model_runner.py:1041] Starting to load model /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct...
Process SpawnProcess-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
return cls(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args,
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
self.model_executor = executor_class(
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1043, in load_model
self.model = get_model(model_config=self.model_config,
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 402, in load_model
model = _initialize_model(model_config, self.load_config,
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 177, in _initialize_model
model_class, _ = get_model_architecture(model_config)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 67, in get_model_architecture
return ModelRegistry.resolve_model_cls(architectures)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/__init__.py", line 188, in resolve_model_cls
raise ValueError(
ValueError: Model architectures ['Qwen3MoeForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'Qwen2VLForConditionalGeneration', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'TeleChat12BForCausalLM', 'SolarForCausalLM', 'ArcticForCausalLM', 'XverseForCausalLM', 'Phi3SmallForCausalLM', 'MedusaModel', 'EAGLEModel', 'MLPSpeculatorPreTrainedModel', 'JambaForCausalLM', 'GraniteForCausalLM', 'MistralModel', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'PaliGemmaForConditionalGeneration', 'Phi3VForCausalLM', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'UltravoxModel', 'MllamaForConditionalGeneration', 'BartModel', 'BartForConditionalGeneration']
I1211 11:26:34.671990 6097 ProcessGroupNCCL.cpp:1126] [PG 0 Rank 0] ProcessGroupNCCL destructor entered.
I1211 11:26:34.672050 6097 ProcessGroupNCCL.cpp:1111] [PG 0 Rank 0] Launching ProcessGroupNCCL abort asynchrounously.
I1211 11:26:34.672595 6097 ProcessGroupNCCL.cpp:1016] [PG 0 Rank 0] future is successfully executed for: ProcessGroup abort
I1211 11:26:34.672617 6097 ProcessGroupNCCL.cpp:1117] [PG 0 Rank 0] ProcessGroupNCCL aborts successfully.
I1211 11:26:34.672636 6097 ProcessGroupNCCL.cpp:1149] [PG 0 Rank 0] ProcessGroupNCCL watchdog thread joined.
I1211 11:26:34.672698 6097 ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL heart beat monitor thread joined.
Traceback (most recent call last):
File "/opt/conda/bin/vllm", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
async with build_async_engine_client(args) as engine_client:
File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
又到这个问题了,看来需要升级vllm了啊
升级之后报错torch.OutOfMemoryError: HIP out of memory
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 816.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 58.57 GiB is allocated by PyTorch, and 4.71 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
加上--gpu_memory_utilization 0.95
vllm serve "/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct" --gpu_memory_utilization 0.90
还是不行,这个816Mib怎么解决啊
return torch._C._nn.pad(input, pad, mode, value)
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 816.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 58.57 GiB is allocated by PyTorch, and 4.71 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
看看这里面有没有能减小的
INFO 12-11 13:02:09 [llm_engine.py:244] Initializing a V0 LLM engine (v0.8.4) with config: model='/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct', speculative_config=None, tokenizer='/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
试试这个:--max_seq_len 128144
换用双卡,报错
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct'. Use `repo_type` argument if needed.
先搁置。
发现双卡可以启动这个模型,镜像直接就带了启动文件。
双卡启动qwen3 30b,auto-coder报错
╭─ 🔥 Error ───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Error: Connection failed. Error: Error code: 400 - {'object': 'error', 'message': "This model's maximum context │
│ length is 4096 tokens. However, you requested 30424 tokens (22328 in the messages, 8096 in the completion). Please │
│ reduce the length of the messages or completion.", 'type': 'BadRequestError', 'param': None, 'code': 400}
换这样
vllm serve "/root/public_data/model/admin/Qwen/Qwen3-30B-A3B" --served-model-name "Qwen3-30B-A3B" --tensor-parallel-size 2 --max-model-len 65536 --dtype bfloat16 --enforce-eager --host 0.0.0.0 --port 8000 --disable-log-stats --enable-reasoning --reasoning-parser deepseek_r1
最后换成这样,总算跑起来了