部署qwen NEXT 架构模型 Qwen3-Next-80B-A3B-Instruct

资源开通:

ecs.gn7e-c16g1.16xlarge 4 * NVIDIA A100 80G

OS镜像选择(测试阶段免去组件预安装困扰):

安装GPU驱动:

如果选用"安装 GPU 驱动"会基于如下脚本安装GPU驱动

"#!/bin/sh

#Please input version to install

DRIVER_VERSION="570.195.03"

CUDA_VERSION="12.8.1"

CUDNN_VERSION="9.8.0.87"

IS_INSTALL_eRDMA="FALSE"

IS_INSTALL_RDMA="FALSE"

INSTALL_DIR="/root/auto_install"

#using .run to install driver and cuda

auto_install_script="auto_install_v4.0.sh"

script_download_url=(curlhttp://100.100.100.200/latest/meta−data/source−address∣head−1)"/opsx/ecs/linux/binary/script/(curl http://100.100.100.200/latest/meta-data/source-address | head -1)"/opsx/ecs/linux/binary/script/(curlhttp://100.100.100.200/latest/meta−data/source−address∣head−1)"/opsx/ecs/linux/binary/script/{auto_install_script}"

echo $script_download_url

rm -rf $INSTALL_DIR

mkdir -p $INSTALL_DIR

cd INSTALL_DIR \&\& wget -t 10 --timeout=10 script_download_url && bash INSTALLDIR/{INSTALL_DIR}/INSTALLDIR/{auto_install_script} DRIVER_VERSION CUDA_VERSION CUDNN_VERSION IS_INSTALL_RDMA $IS_INSTALL_eRDMA "

分配公网以及200M 带宽,加速镜像拉取。

检查安装情况

模型下载:

最快、最推荐的方法是使用 hf download (Hugging Face CLI)。

以下是详细的原因和操作建议:

  1. 为什么选 hf download 或 huggingface-cli?

多线程下载: 传统的 git clone(即使带了 LFS)通常是单线程或并发受限的。Hugging Face 官方的 CLI 工具支持多线程并行下载,能跑满你的带宽。

断点续传: 像 Qwen3-80B 这种模型文件非常大(可能超过 100GB),一旦网络波动,git clone 很容易中断且难以恢复,而 CLI 工具支持稳健的断点续传。

更智能的缓存: 它会自动管理缓存软链接,方便你在多个项目间复用模型。

Miniconda安装

用miniconda做环境隔离,OS自带的python版本可能低一些,随意改动版本可能影响其他项目。

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh

// 一路输入 yes 或按回车,安装完后重启终端//

conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main

conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

~/miniconda3/bin/conda init bas

source ~/.bashrc

nohup hf download Qwen/Qwen3-Next-80B-A3B-Instruct --local-dir Qwen3-80B > download.log 2>&1 &

挂在后台,防止网络断了影响下载。

pip install vllm

//pip install 'vllm>=0.10.2'

1. 创建环境

conda create -n hf_download python=3.10 -y

2. 激活环境

conda activate hf_download

3. 安装工具

pip install -U "huggingface_hub[cli]"

4. 开始高速下载模型

huggingface-cli download Qwen/Qwen3-Next-80B-A3B-Instruct --local-dir Qwen3-80B

升级python版本到3.9+

wget https://www.python.org/ftp/python/3.9.0/Python-3.9.0.tgz && tar -xzf Python-3.9.0.tgz

cd Python-3.9.0 && ./configure --enable-optimizations

下载hf

curl -LsSf https://hf.co/cli/install.sh | bash

root@iZ6wefjm84ecrn6084vwv5Z \~\]# curl -LsSf https://hf.co/cli/install.sh \| bash \[INFO\] Installing Hugging Face CLI... \[INFO\] OS: linux \[INFO\] Force reinstall: false \[INFO\] Install dir: /root/.hf-cli \[INFO\] Bin dir: /root/.local/bin \[INFO\] Skip PATH update: false \[WARNING\] python3 detected (Python 3.6.8) but Python 3.9+ is required. \[WARNING\] python detected (Python 3.6.8) but Python 3.9+ is required. \[ERROR\] Python 3.9+ is required but was not found. \[INFO\] On Fedora/RHEL: sudo dnf install python3 python3-pip #### 1. 安装 pip install -U "huggingface_hub\[cli\]" #### 2. 下载 (使用 --parallel 选项加快速度) huggingface-cli download Qwen/Qwen3-Next-80B-A3B-Instruct --local-dir Qwen3-80B 在日本的机器上: 首选: hf download (截图最下方的 cURL 方式)。 备选: huggingface-cli download。 ### vllm 加载模型: (base) \[root@iZ6wefjm84ecrn6084vwv5Z 9c7f2fbe84465e40164a94cc16cd30b6999b0cc7\]# vllm serve . \\ > --tensor-parallel-size 4 > > --served-model-name qwen3-next > > --gpu-memory-utilization 0.85 > > --trust-remote-code > > (APIServer pid=46226) INFO 02-10 19:54:24 \[utils.py:325

(APIServer pid=46226) INFO 02-10 19:54:24 [utils.py:325] █ █ █▄ ▄█

(APIServer pid=46226) INFO 02-10 19:54:24 [utils.py:325] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.15.1

(APIServer pid=46226) INFO 02-10 19:54:24 [utils.py:325] █▄█▀ █ █ █ █ model .

(APIServer pid=46226) INFO 02-10 19:54:24 [utils.py:325] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀

(APIServer pid=46226) INFO 02-10 19:54:24 [utils.py:325]

(APIServer pid=46226) INFO 02-10 19:54:24 [utils.py:261] non-default args: {'model_tag': '.', 'api_server_count': 1, 'model': '.', 'trust_remote_code': True, 'served_model_name': ['qwen3-next'], 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.85}

(APIServer pid=46226) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.

(APIServer pid=46226) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.

(APIServer pid=46226) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.

(APIServer pid=46226) INFO 02-10 19:54:30 [model.py:541] Resolved architecture: Qwen3NextForCausalLM

(APIServer pid=46226) INFO 02-10 19:54:30 [model.py:1561] Using max model len 262144

(APIServer pid=46226) INFO 02-10 19:54:30 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.

(APIServer pid=46226) INFO 02-10 19:54:31 [config.py:504] Setting attention block size to 272 tokens to ensure that attention page size is >= mamba page size.

(APIServer pid=46226) INFO 02-10 19:54:31 [config.py:535] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.

(APIServer pid=46226) INFO 02-10 19:54:31 [vllm.py:624] Asynchronous scheduling is enabled.

(EngineCore_DP0 pid=46543) INFO 02-10 19:54:35 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='.', speculative_config=None, tokenizer='.', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3-next, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}

(EngineCore_DP0 pid=46543) WARNING 02-10 19:54:35 [multiproc_executor.py:910] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.

INFO 02-10 19:54:41 [parallel_state.py:1212] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:49101 backend=nccl

INFO 02-10 19:54:41 [parallel_state.py:1212] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:49101 backend=nccl

INFO 02-10 19:54:41 [parallel_state.py:1212] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:49101 backend=nccl

INFO 02-10 19:54:41 [parallel_state.py:1212] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:49101 backend=nccl

INFO 02-10 19:54:42 [pynccl.py:111] vLLM is using nccl==2.27.5

WARNING 02-10 19:54:42 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available.

WARNING 02-10 19:54:42 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available.

WARNING 02-10 19:54:42 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available.

WARNING 02-10 19:54:42 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available.

INFO 02-10 19:54:42 [parallel_state.py:1423] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1

INFO 02-10 19:54:42 [parallel_state.py:1423] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2

INFO 02-10 19:54:42 [parallel_state.py:1423] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3

INFO 02-10 19:54:42 [parallel_state.py:1423] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0

(Worker_TP0 pid=46677) INFO 02-10 19:54:43 [gpu_model_runner.py:4033] Starting to load model ...

(Worker_TP0 pid=46677) INFO 02-10 19:55:00 [unquantized.py:103] Using TRITON backend for Unquantized MoE

(Worker_TP0 pid=46677) INFO 02-10 19:55:00 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')

Loading safetensors checkpoint shards: 0% Completed | 0/41 [00:00<?, ?it/s]

Loading safetensors checkpoint shards: 2% Completed | 1/41 [00:00<00:20, 1.98it/s]

Loading safetensors checkpoint shards: 5% Completed | 2/41 [00:01<00:21, 1.79it/s]

Loading safetensors checkpoint shards: 7% Completed | 3/41 [00:01<00:22, 1.73it/s]

Loading safetensors checkpoint shards: 10% Completed | 4/41 [00:02<00:22, 1.66it/s]

Loading safetensors checkpoint shards: 12% Completed | 5/41 [00:03<00:22, 1.60it/s]

Loading safetensors checkpoint shards: 15% Completed | 6/41 [00:03<00:20, 1.73it/s]

Loading safetensors checkpoint shards: 17% Completed | 7/41 [00:04<00:19, 1.77it/s]

Loading safetensors checkpoint shards: 20% Completed | 8/41 [00:04<00:18, 1.74it/s]

Loading safetensors checkpoint shards: 22% Completed | 9/41 [00:05<00:18, 1.74it/s]

Loading safetensors checkpoint shards: 24% Completed | 10/41 [00:05<00:17, 1.76it/s]

Loading safetensors checkpoint shards: 27% Completed | 11/41 [00:06<00:16, 1.81it/s]

Loading safetensors checkpoint shards: 29% Completed | 12/41 [00:06<00:16, 1.74it/s]

Loading safetensors checkpoint shards: 32% Completed | 13/41 [00:07<00:16, 1.73it/s]

Loading safetensors checkpoint shards: 34% Completed | 14/41 [00:08<00:15, 1.74it/s]

Loading safetensors checkpoint shards: 37% Completed | 15/41 [00:08<00:15, 1.70it/s]

Loading safetensors checkpoint shards: 39% Completed | 16/41 [00:09<00:14, 1.70it/s]

Loading safetensors checkpoint shards: 41% Completed | 17/41 [00:09<00:13, 1.73it/s]

Loading safetensors checkpoint shards: 46% Completed | 19/41 [00:10<00:09, 2.22it/s]

Loading safetensors checkpoint shards: 49% Completed | 20/41 [00:11<00:10, 2.00it/s]

Loading safetensors checkpoint shards: 51% Completed | 21/41 [00:11<00:10, 1.88it/s]

Loading safetensors checkpoint shards: 54% Completed | 22/41 [00:12<00:10, 1.83it/s]

Loading safetensors checkpoint shards: 56% Completed | 23/41 [00:12<00:09, 1.82it/s]

Loading safetensors checkpoint shards: 59% Completed | 24/41 [00:13<00:09, 1.81it/s]

Loading safetensors checkpoint shards: 61% Completed | 25/41 [00:13<00:09, 1.76it/s]

Loading safetensors checkpoint shards: 63% Completed | 26/41 [00:14<00:08, 1.72it/s]

Loading safetensors checkpoint shards: 66% Completed | 27/41 [00:15<00:08, 1.71it/s]

Loading safetensors checkpoint shards: 68% Completed | 28/41 [00:15<00:07, 1.70it/s]

Loading safetensors checkpoint shards: 71% Completed | 29/41 [00:16<00:07, 1.67it/s]

Loading safetensors checkpoint shards: 73% Completed | 30/41 [00:17<00:06, 1.65it/s]

Loading safetensors checkpoint shards: 76% Completed | 31/41 [00:17<00:06, 1.63it/s]

Loading safetensors checkpoint shards: 78% Completed | 32/41 [00:18<00:05, 1.67it/s]

Loading safetensors checkpoint shards: 80% Completed | 33/41 [00:18<00:04, 1.66it/s]

Loading safetensors checkpoint shards: 83% Completed | 34/41 [00:19<00:04, 1.71it/s]

Loading safetensors checkpoint shards: 85% Completed | 35/41 [00:19<00:03, 1.70it/s]

Loading safetensors checkpoint shards: 88% Completed | 36/41 [00:20<00:03, 1.66it/s]

Loading safetensors checkpoint shards: 90% Completed | 37/41 [00:21<00:02, 1.65it/s]

Loading safetensors checkpoint shards: 93% Completed | 38/41 [00:21<00:01, 1.68it/s]

Loading safetensors checkpoint shards: 95% Completed | 39/41 [00:22<00:01, 1.71it/s]

Loading safetensors checkpoint shards: 98% Completed | 40/41 [00:22<00:00, 1.74it/s]

Loading safetensors checkpoint shards: 100% Completed | 41/41 [00:23<00:00, 1.76it/s]

Loading safetensors checkpoint shards: 100% Completed | 41/41 [00:23<00:00, 1.75it/s]

(Worker_TP0 pid=46677)

(Worker_TP0 pid=46677) INFO 02-10 19:55:24 [default_loader.py:291] Loading weights took 23.52 seconds

(Worker_TP0 pid=46677) INFO 02-10 19:55:24 [gpu_model_runner.py:4130] Model loading took 37.22 GiB memory and 40.701756 seconds

(Worker_TP0 pid=46677) INFO 02-10 19:55:33 [backends.py:812] Using cache directory: /root/.cache/vllm/torch_compile_cache/3bbbe12d51/rank_0_0/backbone for vLLM's torch.compile

(Worker_TP0 pid=46677) INFO 02-10 19:55:33 [backends.py:872] Dynamo bytecode transform time: 8.17 s

(Worker_TP0 pid=46677) INFO 02-10 19:55:41 [backends.py:302] Cache the graph of compile range (1, 2048) for later use

(Worker_TP2 pid=46679) INFO 02-10 19:55:41 [backends.py:302] Cache the graph of compile range (1, 2048) for later use

(Worker_TP3 pid=46680) INFO 02-10 19:55:41 [backends.py:302] Cache the graph of compile range (1, 2048) for later use

(Worker_TP1 pid=46678) INFO 02-10 19:55:41 [backends.py:302] Cache the graph of compile range (1, 2048) for later use

(Worker_TP0 pid=46677) INFO 02-10 19:55:45 [fused_moe.py:1077] Using configuration from /root/miniconda3/lib/python3.13/site-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_A100-SXM4-80GB.json for MoE layer.

(Worker_TP0 pid=46677) INFO 02-10 19:55:58 [backends.py:319] Compiling a graph for compile range (1, 2048) takes 20.67 s

(Worker_TP0 pid=46677) INFO 02-10 19:55:58 [monitor.py:34] torch.compile takes 28.84 s in total

(Worker_TP0 pid=46677) INFO 02-10 19:56:00 [gpu_worker.py:356] Available KV cache memory: 28.1 GiB

(EngineCore_DP0 pid=46543) INFO 02-10 19:56:00 [kv_cache_utils.py:1307] GPU KV cache size: 613,632 tokens

(EngineCore_DP0 pid=46543) INFO 02-10 19:56:00 [kv_cache_utils.py:1312] Maximum concurrency for 262,144 tokens per request: 9.34x

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████| 51/51 [00:10<00:00, 4.72it/s]

Capturing CUDA graphs (decode, FULL): 94%|███████████████████████████████████████████████████████▋ | 33/35 [00:07<00:00, 5.68it/s](Worker_TP3 pid=46680) INFO 02-10 19:56:19 [custom_all_reduce.py:216] Registering 8342 cuda graph addresses

(Worker_TP2 pid=46679) INFO 02-10 19:56:19 [custom_all_reduce.py:216] Registering 8342 cuda graph addresses

Capturing CUDA graphs (decode, FULL): 97%|█████████████████████████████████████████████████████████▎ | 34/35 [00:07<00:00, 5.67it/s](Worker_TP1 pid=46678) INFO 02-10 19:56:19 [custom_all_reduce.py:216] Registering 8342 cuda graph addresses

Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████| 35/35 [00:07<00:00, 4.46it/s]

(Worker_TP0 pid=46677) INFO 02-10 19:56:19 [custom_all_reduce.py:216] Registering 8342 cuda graph addresses

(Worker_TP0 pid=46677) INFO 02-10 19:56:19 [gpu_model_runner.py:5063] Graph capturing finished in 20 secs, took 1.79 GiB

(EngineCore_DP0 pid=46543) INFO 02-10 19:56:20 [core.py:272] init engine (profile, create kv cache, warmup model) took 55.31 seconds

(EngineCore_DP0 pid=46543) INFO 02-10 19:56:20 [vllm.py:624] Asynchronous scheduling is enabled.

(APIServer pid=46226) INFO 02-10 19:56:20 [api_server.py:665] Supported tasks: ['generate']

(APIServer pid=46226) WARNING 02-10 19:56:20 [model.py:1371] Default vLLM sampling parameters have been overridden by the model's generation_config.json: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}. If this is not intended, please relaunch vLLM instance with --generation-config vllm.

(APIServer pid=46226) INFO 02-10 19:56:20 [serving.py:177] Warming up chat template processing...

(APIServer pid=46226) INFO 02-10 19:56:21 [hf.py:310] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.

(APIServer pid=46226) INFO 02-10 19:56:21 [serving.py:212] Chat template warmup completed in 320.9ms

(APIServer pid=46226) INFO 02-10 19:56:21 [api_server.py:946] Starting vLLM API server 0 on http://0.0.0.0:8000

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:38] Available routes are:

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /docs, Methods: HEAD, GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /redoc, Methods: HEAD, GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /tokenize, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /detokenize, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /inference/v1/generate, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /pause, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /resume, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /is_paused, Methods: GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /metrics, Methods: GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /health, Methods: GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/chat/completions, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/responses, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/audio/translations, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/completions, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/completions/render, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/messages, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/models, Methods: GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /load, Methods: GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /version, Methods: GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /ping, Methods: GET

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /ping, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /invocations, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /classify, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/embeddings, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /score, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/score, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /rerank, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v1/rerank, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /v2/rerank, Methods: POST

(APIServer pid=46226) INFO 02-10 19:56:21 [launcher.py:46] Route: /pooling, Methods: POST

(APIServer pid=46226) INFO: Started server process [46226]

(APIServer pid=46226) INFO: Waiting for application startup.

(APIServer pid=46226) INFO: Application startup complete.

root@iZ6wefjm84ecrn6084vwv5Z \~\]# curl -X POST http://localhost:8000/v1/chat/completions '{"model": "qwen3-next","messages": \[{"role": "user","content": "为什么大模型TOKEN对于云计算企业这样重要。"}\]}' {"id":"chatcmpl-ae023d3165ec42f5","object":"chat.completion","created":1770725767,"model":"qwen3-next","choices":\[{"index":0,"message":{"role":"assistant","content":"这是一个非常关键且前沿的问题。在大模型(Large Language Models, LLMs)时代,"Token"虽然只是一个技术术语(指文本被分词后的最小单位),但它对云计算企业的重要性远超其字面含义------它已成为**算力经济、商业模式、竞争壁垒和收入引擎** 的核心计量单位。以下是详细解析:\\n\\n---\\n\\n### 一、Token 是大模型使用量的"基本计量单位"\\n\\n大模型的运行依赖于"输入 + 输出"的文本处理。无论是用户提问、生成文章、写代码,还是企业API调用,系统都必须将文本切分为Token进行处理。\\n\\n- 1个Token ≈ 0.75个英文单词 或 1\~2个中文字符\\n- 一次对话可能消耗几百到上万个Token\\n- 一次API调用的计费 = 输入Token数 + 输出Token数\\n\\n👉 **因此,Token = 用户使用量 = 云计算资源消耗量 = 企业收入来源** \\n\\n---\\n\\n### 二、Token驱动云计算企业的商业模式变革\\n\\n#### 1. **从"卖服务器"到"卖Token"** \\n过去:AWS、阿里云卖的是EC2实例、存储、带宽。\\n现在:OpenAI、Anthropic、百度文心、通义千问等通过API提供大模型服务,**按Token收费** 。\\n\\n- OpenAI:$0.50 / 1M 输入Token,$1.50 / 1M 输出Token(GPT-4-turbo)\\n- 阿里云通义千问:按Token计费,价格远低于OpenAI\\n- 腾讯混元:按调用次数或Token阶梯计价\\n\\n**云计算企业不再卖硬件,而是卖"智能计算服务"------而Token就是这个服务的"货币"。** \\n\\n#### 2. **Token成为定价与利润的核心** \\n- 模型推理成本 = Token数 × 每Token的算力开销(GPU时间、显存、能耗)\\n- 云计算企业通过优化模型压缩、推理加速、批处理等技术,**降低每Token成本** ,从而提升利润率\\n- 高并发、高效率的云平台,能以更低的每Token成本提供服务,形成价格优势\\n\\n\> 💡 举例:如果A公司每Token成本是$0.0000002,B公司是$0.0000005,那么A在同样收入下利润高出150%!\\n\\n---\\n\\n### 三、Token是云计算企业竞争的"新战场"\\n\\n\| 竞争维度 \| 传统云服务 \| 大模型时代 \|\\n\|----------\|-------------\|--------------\|\\n\| 核心指标 \| CPU利用率、网络延迟 \| 每秒处理Token数(TPS)、每Token成本 \|\\n\| 竞争焦点 \| 价格、稳定性、地域覆盖 \| 推理速度、Token性价比、长上下文支持 \|\\n\| 技术壁垒 \| 虚拟化、网络优化 \| 模型并行、KV缓存优化、FlashAttention、量化压缩 \|\\n\\n👉 谁能用更少的算力处理更多Token,谁就能赢得客户、抢占市场。\\n\\n---\\n\\n### 四、Token数据是云计算企业的"黄金资产"\\n\\n- **用户行为数据** :通过Token分析,企业知道用户在问什么、用在什么场景(客服、写作、编程、医疗)\\n- **模型迭代依据** :高频Token组合 → 指导模型微调、数据增强、领域优化\\n- **商业化拓展** :识别高价值行业(如法律、金融)→ 推出垂直行业模型 + 高价套餐\\n\\n\> 比如:阿里云发现"合同生成"类Token激增 → 推出"法务通义"专属模型 → 收费翻倍\\n\\n---\\n\\n### 五、Token推动云计算架构重构\\n\\n为了高效处理Token,云厂商必须:\\n\\n1. **部署专用AI芯片** :如阿里含光800、华为昇腾、AWS Trainium/Inferentia\\n2. **构建推理集群** :优化KV Cache、PagedAttention、连续批处理(Continuous Batching)\\n3. **边缘+云端协同** :轻量模型在边缘处理短Token,重模型在云端处理长上下文\\n4. **动态扩缩容** :根据Token请求峰值自动调度GPU资源\\n\\n👉 这些技术投入,都是围绕"高效处理Token"展开的。\\n\\n---\\n\\n### 六、未来趋势:Token将成为"数字世界的石油"\\n\\n- **Token经济** :未来可能出现Token交易市场、Token期货、Token质押\\n- **企业成本结构** :SaaS公司成本中"LLM Token支出"将超过服务器、人力\\n- **监管焦点** :各国可能对Token能耗、碳足迹进行统计与征税\\n\\n---\\n\\n### ✅ 总结:为什么Token对云计算企业如此重要?\\n\\n\| 维度 \| 重要性 \|\\n\|------\|--------\|\\n\| **收入来源** \| Token是计费单位,直接决定营收 \|\\n\| **成本控制** \| 每Token成本决定利润率 \|\\n\| **技术竞争** \| 处理Token的效率 = 核心竞争力 \|\\n\| **商业洞察** \| Token数据 = 用户需求地图 \|\\n\| **战略转型** \| 从IaaS/PaaS向AI服务转型的核心载体 \|\\n\| **生态构建** \| 基于Token的API经济催生开发者生态 \|\\n\\n\> 🌟 **一句话总结** : \\n\> **在大模型时代,云计算企业不再卖"算力",而是卖"智能文本处理能力"------而Token,就是这种能力的唯一通用货币。** \\n\\n---\\n\\n如果你是云计算企业的决策者,**你必须把"Token"当作核心KPI来管理**------就像当年管理CPU利用率一样。谁掌握了Token的效率与规模,谁就掌握了AI时代的云市场。","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":\[\],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}\],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":18,"total_tokens":1261,"completion_tokens":1243,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ![在这里插入图片描述](https://i-blog.csdnimg.cn/direct/ae1828fd11fa455d993ae6b441e66bbc.png) 参考: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd\&from=research.latest-advancements-list https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct https://modelscope.cn/models/Qwen/Qwen3-Next-80B-A3B-Instruct

相关推荐
牛奶11 小时前
《前端架构设计》:除了写代码,我们还得管点啥
前端·架构·设计
苏渡苇12 小时前
Java + Redis + MySQL:工业时序数据缓存与持久化实战(适配高频采集场景)
java·spring boot·redis·后端·spring·缓存·架构
麦聪聊数据13 小时前
如何用 B/S 架构解决混合云环境下的数据库连接碎片化难题?
运维·数据库·sql·安全·架构
2的n次方_13 小时前
CANN HCOMM 底层架构深度解析:异构集群通信域管理、硬件链路使能与算力重叠优化机制
架构
技术传感器13 小时前
大模型从0到精通:对齐之心 —— 人类如何教会AI“好“与“坏“ | RLHF深度解析
人工智能·深度学习·神经网络·架构
小北的AI科技分享14 小时前
万亿参数时代:大语言模型的技术架构与演进趋势
架构·模型·推理
一条咸鱼_SaltyFish17 小时前
从零构建个人AI Agent:Node.js + LangChain + 上下文压缩全流程
网络·人工智能·架构·langchain·node.js·个人开发·ai编程
码云数智-园园17 小时前
解决 IntelliJ IDEA 运行 Spring Boot 测试时“命令行过长”错误
架构
AC赳赳老秦18 小时前
虚拟化技术演进:DeepSeek适配轻量级虚拟机,实现AI工作负载高效管理
人工智能·python·架构·数据挖掘·自动化·数据库架构·deepseek
Francek Chen18 小时前
【大数据存储与管理】分布式文件系统HDFS:01 分布式文件系统
大数据·hadoop·分布式·hdfs·架构