华为昇腾310P废物利用——大模型推理服务

华为昇腾310P废物利用

注:310P不支持bf16、W4A4

带宽200G,双芯版的300I duo, 有48g和96g两种

目前市面上所有昇腾的卡均不支持FP8

最终性能优化结果:

Qwen3-8B-W8A8

TPS :15Tokens/s

昇腾的PyTorch图模式使用和vllm-ascend的源码,里面有reduce-overhead和max-autotune两种模式,reduce-overhead只支持910B和910C,而且vllm-ascend里面写死了reduce-overhead模式

MindIE + Qwen 3-8B-W8A8

bash 复制代码
1. Launch the container on the host

  docker run -it -d --net=host --shm-size=16g \
    --name mindie-qwen3-8b-310p \
    -w /workspace/MindIE-LLM/examples/atb_models \
    --device=/dev/davinci0:rwm \
    --device=/dev/davinci1:rwm \
    --device=/dev/davinci2:rwm \
    --device=/dev/davinci3:rwm \
    --device=/dev/davinci_manager:rwm \
    --device=/dev/hisi_hdc:rwm \
    --device=/dev/devmm_svm:rwm \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/dcmi:/usr/local/dcmi:ro \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro \
    -v /usr/local/sbin:/usr/local/sbin:ro \
    -v /Users/zhaojiacheng/repos/MindIE-LLM:/workspace/MindIE-LLM \
    -v /home/s_zhaojiacheng:/home/s_zhaojiacheng \
    swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:3.0.0b2-300I-Duo-py311-openeuler24.03-lts \
    bash

  Enter the container:

  docker exec -it mindie-qwen3-8b-310p bash

  2. Prepare the environment inside the container

  cd /workspace/MindIE-LLM
  scripts/qwen3_8b_310p_w8a8sc.sh prepare-env

  3. Download the model from ModelScope

  Recommended: download directly into a normal directory, not only into the default cache.

  mkdir -p /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s

  modelscope download \
    --model Eco-Tech/Qwen3-8B-w8a8s-310 \
    --local_dir /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s

  If you already downloaded it earlier into the default cache with:

  modelscope download --model Eco-Tech/Qwen3-8B-w8a8s-310

  then flatten it into a real directory first:

  mkdir -p /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s
  cp -aL \
    /home/s_zhaojiacheng/.cache/modelscope/hub/models/Eco-Tech/Qwen3-8B-w8a8s-310/. \
    /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s/

  Check the files exist:

  ls /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s

  4. Compress W8A8S into W8A8SC

  cd /workspace/MindIE-LLM

  scripts/qwen3_8b_310p_w8a8sc.sh compress \
    --w8a8s-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s \
    --w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc

  After it finishes, check the output directory exists:

  ls /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc

  5. Start the OpenAI-compatible server

  cd /workspace/MindIE-LLM

  scripts/qwen3_8b_310p_w8a8sc.sh serve \
    --w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc \
    --model-name qwen3-8b-w8a8sc \
    --port 1025

  This should start mindie_llm_server and expose the OpenAI-compatible endpoint on 127.0.0.1:1025.

  6. Verify the service

  List models:

  curlhttp://127.0.0.1:1025/v1/models

  Expected model id:

  qwen3-8b-w8a8sc

  Test one inference request:

  curlhttp://127.0.0.1:1025/v1/chat/completions\
    -H 'Content-Type: application/json' \
    -d '{
      "model": "qwen3-8b-w8a8sc",
      "messages": [
        {"role": "user", "content": "What is deep learning?"}
      ],
      "max_tokens": 128,
      "stream": false
    }'

  Short version

  If you want the shortest working sequence inside the container:

  cd /workspace/MindIE-LLM

  scripts/qwen3_8b_310p_w8a8sc.sh prepare-env

  modelscope download \
    --model Eco-Tech/Qwen3-8B-w8a8s-310 \
    --local_dir /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s

  scripts/qwen3_8b_310p_w8a8sc.sh compress \
    --w8a8s-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s \
    --w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc

  scripts/qwen3_8b_310p_w8a8sc.sh serve \
    --w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc \
    --model-name qwen3-8b-w8a8sc \
    --port 1025

  Then test:

  curlhttp://127.0.0.1:1025/v1/models

  One important detail: for this single-310P flow, do not try to serve Qwen3-8B-w8a8s-310 directly. The supported path is download W8A8S -> compress to W8A8SC -> serve W8A8SC.

  If you want, I can also rewrite this into one clean host-side bash script that does docker run, docker exec, download, compress, and serve end to end.
相关推荐
志栋智能3 小时前
AI驱动无代码:降低巡检超自动化的门槛
大数据·运维·网络·人工智能·自动化
qcx233 小时前
【系统学AI】25 论文导读 ①:两篇改变 AI 的开山之作——Attention Is All You Need & ReAct
前端·人工智能·react.js·transformer
Black蜡笔小新4 小时前
自动化AI算法训练服务器DLTM制造业AI质检工作站助力制造业实现AI智检
人工智能·算法·自动化
川冰ICE4 小时前
⑮ AI音乐与音频:工具详解与创作流程
人工智能·音视频
米小虾4 小时前
2026 年多模态大模型全面爆发:从「看懂图」到「听懂世界」的技术跃迁
人工智能
米小虾4 小时前
AI Agent 进入协议时代:MCP、A2A、AG-UI 三大协议全景解析
人工智能·agent
蝎子莱莱爱打怪4 小时前
🚀 🚀🚀2026年5月GitHub月榜精选:17个项目中挑出10个推荐,实操4个!
人工智能·后端·ai编程
升鲜宝供应链及收银系统源代码服务4 小时前
升鲜宝AI助手项目源码集成开发步骤(一)---升鲜宝生鲜配送供应链管理系统源代码服务
人工智能·生鲜配送系统·生鲜物流线路规划·生鲜电商订单系统·生鲜供应链系统·生鲜系统架构设计·生鲜配送ai功能集成
yjcode7894 小时前
探索游戏充值新纪元:友价源码技术革新之旅
大数据·人工智能·游戏·游戏交易
冬奇Lab4 小时前
Agent 系列(11):A2A 协议——Agent 与 Agent 如何协作
人工智能·agent