华为鲲鹏arm+昇腾310p+欧拉系统+vllm-ascend:v0.20.2rc1-310p+qwen3-32b

直接上结论:

成功的组合:

权重文件:https://www.modelscope.cn/models/Eco-Tech/Qwen3-32B-w8a8sc-310-vllm

推理镜像:http://quay.io/ascend/vllm-ascend:v0.20.2rc1-310p

参考文档:

https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#

失败的推理镜像:

https://www.hiascend.com/developer/ascendhub/detail/44d97ca10b0845b582336f2161d1c3a8

http://quay.io/ascend/vllm-ascend:v0.18.0-310p

root@localhost Qwen3-32B-w8a8sc-310-vllm# lscpu

Architecture: aarch64

CPU op-mode(s): 64-bit

Byte Order: Little Endian

CPU(s): 32

On-line CPU(s) list: 0-31

Vendor ID: HiSilicon

BIOS Vendor ID: HiSilicon

Model name: Kunpeng-920

BIOS Model name: HUAWEI Kunpeng 920S

Model: 0

Thread(s) per core: 1

Core(s) per socket: 32

Socket(s): 1

Stepping: 0x1

Frequency boost: disabled

CPU max MHz: 2600.0000

CPU min MHz: 200.0000

BogoMIPS: 200.00

Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs

Caches (sum of all):

L1d: 2 MiB (32 instances)

L1i: 2 MiB (32 instances)

L2: 16 MiB (32 instances)

L3: 32 MiB (1 instance)

NUMA:

NUMA node(s): 1

NUMA node0 CPU(s): 0-31

Vulnerabilities:

Gather data sampling: Not affected

Itlb multihit: Not affected

L1tf: Not affected

Mds: Not affected

Meltdown: Not affected

Mmio stale data: Not affected

Retbleed: Not affected

Spec rstack overflow: Not affected

Spec store bypass: Not affected

Spectre v1: Mitigation; __user pointer sanitization

Spectre v2: Not affected

Srbds: Not affected

Tsx async abort: Not affected

root@localhost Qwen3-32B-w8a8sc-310-vllm# lsmem

RANGE SIZE STATE REMOVABLE BLOCK

0x0000000000000000-0x000000007fffffff 2G online yes 0-1

0x0000002080000000-0x0000003fffffffff 126G online yes 130-255

Memory block size: 1G

Total online memory: 128G

Total offline memory: 0B

root@localhost Qwen3-32B-w8a8sc-310-vllm# free -h

total used free shared buff/cache available

Mem: 124Gi 36Gi 4.0Gi 258Mi 85Gi 88Gi

Swap: 127Gi 465Mi 127Gi

root@localhost Qwen3-32B-w8a8sc-310-vllm#

root@localhost Qwen3-32B-w8a8sc-310-vllm# cat /etc/os-release

NAME="openEuler"

VERSION="22.03 (LTS-SP4)"

ID="openEuler"

VERSION_ID="22.03"

PRETTY_NAME="openEuler 22.03 (LTS-SP4)"

ANSI_COLOR="0;31"

root@localhost Qwen3-32B-w8a8sc-310-vllm#

root@localhost Qwen3-32B-w8a8sc-310-vllm# cat /etc/docker/daemon.json

复制代码
{        "runtimes":     {                "ascend":       {                        "path": "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime",                        "runtimeArgs":  []                }        },        "data-root":"/data/docker/",        "default-runtime":      "ascend",     "registry-mirrors": [         "https://registry.docker-cn.com",         "https://docker.m.daocloud.io",         "https://dockerproxy.com",         "https://docker.mirrors.ustc.edu.cn",         "https://docker.nju.edu.cn",         "https://iju9kaj2.mirror.aliyuncs.com",         "http://hub-mirror.c.163.com",         "https://cr.console.aliyun.com",         "https://hub.docker.com",         "http://mirrors.ustc.edu.cn",         "https://docker.xuanyuan.me"     ]}

root@localhost Qwen3-32B-w8a8sc-310-vllm#

root@localhost Qwen3-32B-w8a8sc-310-vllm# lsblk -do name,size,rota

NAME SIZE ROTA

sda 3.6T 1

nvme0n1 465.8G 0

root@localhost Qwen3-32B-w8a8sc-310-vllm# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS

sda 8:0 0 3.6T 0 disk /data

nvme0n1 259:0 0 465.8G 0 disk

├─nvme0n1p1 259:1 0 600M 0 part /boot/efi

├─nvme0n1p2 259:2 0 1G 0 part /boot

├─nvme0n1p3 259:3 0 128G 0 part SWAP

└─nvme0n1p4 259:4 0 336.2G 0 part /

root@localhost Qwen3-32B-w8a8sc-310-vllm# df -hT

Filesystem Type Size Used Avail Use% Mounted on

devtmpfs devtmpfs 4.0M 0 4.0M 0% /dev

tmpfs tmpfs 63G 8.0K 63G 1% /dev/shm

tmpfs tmpfs 25G 28M 25G 1% /run

tmpfs tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup

/dev/nvme0n1p4 ext4 330G 37G 277G 12% /

/dev/sda ext4 3.6T 630G 2.8T 19% /data

tmpfs tmpfs 63G 508K 63G 1% /tmp

/dev/nvme0n1p2 ext4 974M 253M 655M 28% /boot

/dev/nvme0n1p1 vfat 599M 6.5M 593M 2% /boot/efi

tmpfs tmpfs 13G 36K 13G 1% /run/user/1000

tmpfs tmpfs 13G 0 13G 0% /run/user/0

overlay overlay 3.6T 630G 2.8T 19% /data/docker/overlay2/3cec6fec0ecc250c52a46f08b8cc0fddd6910ef5fa0be1c975ead9c3820b927d/merged

shm tmpfs 10G 700K 10G 1% /data/docker/containers/ca97df9cd1f57f858e1799b265d9fe48ec1271332b7fc83909ff2a82d9db7058/mounts/shm

root@localhost Qwen3-32B-w8a8sc-310-vllm# pwd

/data/models/Qwen3-32B-w8a8sc-310-vllm

root@localhost Qwen3-32B-w8a8sc-310-vllm# ll

total 12

-rw-r--r--. 1 root root 1577 Jun 16 09:47 docker-compose.yml

-rw-r--r--. 1 root root 2178 Jun 15 17:38 README.md

drwxr-xr-x. 3 root root 4096 Jun 15 17:38 TP4

root@localhost Qwen3-32B-w8a8sc-310-vllm# cat docker-compose.yml

复制代码
services:  qwen3-32b:    container_name: qwen3-32b    image: quay.io/ascend/vllm-ascend:v0.20.2rc1-310p    restart: unless-stopped    devices:      - /dev/davinci0:/dev/davinci0      - /dev/davinci1:/dev/davinci1      - /dev/davinci2:/dev/davinci2      - /dev/davinci3:/dev/davinci3      - /dev/davinci_manager:/dev/davinci_manager      - /dev/devmm_svm:/dev/devmm_svm      - /dev/hisi_hdc:/dev/hisi_hdc    shm_size: 10g    ports:      - "8080:8080"    volumes:      - /usr/local/dcmi:/usr/local/dcmi      - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi      - /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/      - /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info      - /etc/ascend_install.info:/etc/ascend_install.info      - /data/models:/root/.cache    environment:      - ASCEND_RT_VISIBLE_DEVICES=0,1,2,3      - TZ=Asia/Shanghai    entrypoint: ["/bin/bash", "-c"]    command:      - >        vllm serve /root/.cache/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4        --host 0.0.0.0        --port 8080        --tensor-parallel-size 4        --gpu_memory_utilization 0.90        --max_num_seqs 32        --served_model_name qwen        --dtype float16        --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}'        --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}'        --quantization ascend        --max_model_len 32768        --no-enable-prefix-caching        --load_format sharded_state

root@localhost Qwen3-32B-w8a8sc-310-vllm# npu-smi info

+--------------------------------------------------------------------------------------------------------+

| npu-smi 24.1.rc2 Version: 24.1.rc2 |

+-------------------------------+-----------------+------------------------------------------------------+

| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) |

| Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) |

+===============================+=================+======================================================+

| 0 310P3 | OK | NA 44 13423 / 13423 |

| 0 0 | 0000:01:00.0 | 0 29140/ 44280 |

+-------------------------------+-----------------+------------------------------------------------------+

| 0 310P3 | OK | NA 43 13393 / 13393 |

| 1 1 | 0000:01:00.0 | 0 28303/ 43693 |

+===============================+=================+======================================================+

| 64 310P3 | OK | NA 45 13393 / 13393 |

| 0 2 | 0000:02:00.0 | 0 28765/ 44280 |

+-------------------------------+-----------------+------------------------------------------------------+

| 64 310P3 | OK | NA 43 13393 / 13393 |

| 1 3 | 0000:02:00.0 | 0 28547/ 43693 |

+===============================+=================+======================================================+

+-------------------------------+-----------------+------------------------------------------------------+

| NPU Chip | Process id | Process name | Process memory(MB) |

+===============================+=================+======================================================+

| 0 0 | 553624 | VLLMWorker_TP | 97 |

| 0 0 | 553605 | VLLMWorker_TP | 98 |

| 0 0 | 553598 | VLLMWorker_TP | 27274 |

| 0 0 | 553855 | VLLMWorker_TP | 97 |

| 0 1 | 553605 | VLLMWorker_TP | 27274 |

+===============================+=================+======================================================+

| 64 0 | 553624 | VLLMWorker_TP | 27274 |

| 64 1 | 553855 | VLLMWorker_TP | 27274 |

+===============================+=================+======================================================+

root@localhost Qwen3-32B-w8a8sc-310-vllm# du -sh *

4.0K docker-compose.yml

4.0K README.md

32G TP4

root@localhost Qwen3-32B-w8a8sc-310-vllm# cd TP4

root@localhost TP4# ll

total 4

drwxr-xr-x. 2 root root 4096 Jun 16 09:02 Qwen3-32B-w8a8sc-310-vllm-tp4

root@localhost TP4# cd Qwen3-32B-w8a8sc-310-vllm-tp4/

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4# ll

total 33414612

-rw-r--r--. 1 root root 727 Jun 15 17:38 config.json

-rw-r--r--. 1 root root 73 Jun 15 17:38 configuration.json

-rw-r--r--. 1 root root 35014 Jun 16 09:02 fusion_result.json

-rw-r--r--. 1 root root 239 Jun 15 17:38 generation_config.json

-rw-r--r--. 1 root root 8551452512 Jun 15 20:42 model-rank-0-part-0.safetensors

-rw-r--r--. 1 root root 8550975968 Jun 15 20:47 model-rank-1-part-0.safetensors

-rw-r--r--. 1 root root 8551050464 Jun 15 21:17 model-rank-2-part-0.safetensors

-rw-r--r--. 1 root root 8548654816 Jun 15 20:16 model-rank-3-part-0.safetensors

-rw-r--r--. 1 root root 125759 Jun 15 17:38 quant_model_description.json

-rw-r--r--. 1 root root 1246 Jun 15 17:38 README.md

-rw-r--r--. 1 root root 9732 Jun 15 17:38 tokenizer_config.json

-rw-r--r--. 1 root root 11422654 Jun 15 17:38 tokenizer.json

-rw-r--r--. 1 root root 2776833 Jun 15 17:38 vocab.json

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4# du -sh .

32G .

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4# docker images|grep 310

quay.io/ascend/vllm-ascend v0.20.2rc1-310p 1ba018a56294 12 days ago 15.8GB

quay.io/ascend/vllm-ascend v0.18.0-310p e17f09537569 6 weeks ago 14.3GB

swr.cn-south-1.myhuaweicloud.com/ascendhub/bge-large-zh-v1.5 7.1.T9-310p-aarch64 bf32177dea37 10 months ago 18GB

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4#

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4# docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

ca97df9cd1f5 quay.io/ascend/vllm-ascend:v0.20.2rc1-310p "/bin/bash -c 'vllm ..." 10 minutes ago Up 10 minutes 0.0.0.0:8080->8080/tcp qwen3-32b

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4#

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4# curl http://192.168.20.123:8080/v1/completions -H "Content-Type: application/json" -d '{

"prompt": "你好,你是谁,擅长做什么",

"max_completion_tokens": 64,

"temperature": 0.0

}'

{"id":"cmpl-be1ec386f55233a4","object":"text_completion","created":1781575140,"model":"qwen","choices":{"index":0,"text":"? 您好!我是通义千问,是阿里巴巴集团旗下的通","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":7,"total_tokens":23,"completion_tokens":16,"prompt_tokens_details":null},"kv_transfer_params":null}root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4#

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4#

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4# docker-compose -v

Docker Compose version v2.32.4

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4#

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4# docker -v

Docker version 18.09.0, build d51e3ad

root@localhost Qwen3-32B-w8a8sc-310-vllm-tp4# docker info

Containers: 2

Running: 1

Paused: 0

Stopped: 1

Images: 8

Server Version: 18.09.0

Storage Driver: overlay2

Backing Filesystem: extfs

Supports d_type: true

Native Overlay Diff: true

Logging Driver: json-file

Cgroup Driver: cgroupfs

Hugetlb Pagesize: 2MB, 64KB, 32MB, 1GB, 64KB, 32MB, 2MB, 1GB (default is 2MB)

Plugins:

Volume: local

Network: bridge host macvlan null overlay

Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog

Swarm: inactive

Runtimes: ascend runc

Default Runtime: ascend

Init Binary: docker-init

containerd version: 871075eb7cc979944ba2d987719cb534bbb87e5c

runc version: N/A

init version: N/A (expected: )

Security Options:

seccomp

Profile: default

Kernel Version: 5.10.0-254.0.0.158.oe2203sp4.aarch64

Operating System: openEuler 22.03 (LTS-SP4)

OSType: linux

Architecture: aarch64

CPUs: 32

Total Memory: 124.3GiB

Name: localhost.localdomain

ID: WPOB:6KZK:4WJK:C6PV:5SO4:4JOY:5VTN:5XCI:CG2Z:BCS6:EQGT:GYNC

Docker Root Dir: /data/docker

Debug Mode (client): false

Debug Mode (server): false

Registry: https://index.docker.io/v1/

Labels:

Experimental: false

Insecure Registries:

127.0.0.0/8

Registry Mirrors:

https://registry.docker-cn.com/

https://docker.m.daocloud.io/

https://dockerproxy.com/

https://docker.mirrors.ustc.edu.cn/

https://docker.nju.edu.cn/

https://iju9kaj2.mirror.aliyuncs.com/

http://hub-mirror.c.163.com/

https://cr.console.aliyun.com/

https://hub.docker.com/

http://mirrors.ustc.edu.cn/

https://docker.xuanyuan.me/

Live Restore Enabled: true

tips:vllm推理框架下无法同时启动bge的模型,原因是npu卡被独占,无法共享使用。