在 vLLM 中部署带有 realtime 接口的 ASR 模型,核心在于区分 原生流式 (Streaming) ASR 与 批量转录 (Batch Transcription) 两种模式。vLLM 在 2026 年初正式引入了 WebSocket 实时流式接口 /v1/realtime,但仅适用于特定架构的 ASR 模型。
一、原生 Realtime 流式 ASR(推荐)
适用于支持因果注意力 + 滑动窗口 的流式模型,如 Voxtral Mini 4B Realtime 。这类模型通过 vLLM 的 /v1/realtime WebSocket 端点提供真正的低延迟实时转录。
**事先从modelscope上下载好模型:**mistralai/Voxtral-Mini-4B-Realtime-2602
1. 环境准备
# 确保安装支持音频的 vLLM(建议最新版)
pip install vllm[audio] -i https://pypi.tuna.tsinghua.edu.cn/simple
# 如需使用 Mistral 系列模型,可能需要额外依赖
pip install transformers accelerate -i https://pypi.tuna.tsinghua.edu.cn/simple
2. 启动服务(以 Voxtral 为例)
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--trust-remote-code \
--compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
--tensor-parallel-size 1 \
--max-model-len 45000 \
--max-num-batched-tokens 8192 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.30 \
--host 0.0.0.0 --port 8111
关键参数说明:
-
--tokenizer-mode mistral/--config-format mistral/--load-format mistral:Mistral 模型特有的加载格式 -
--max-model-len 45000:支持约 3 小时连续音频(默认配置) -
--max-num-batched-tokens 8192:控制批处理大小以平衡延迟与吞吐量
3. 客户端连接(WebSocket)
通过 WebSocket 连接 /v1/realtime,发送 PCM16 @ 16kHz 音频流:
import asyncio
import base64
import json
import websockets
import librosa
import numpy as np
def audio_to_pcm16_base64(audio_path: str) -> str:
audio, _ = librosa.load(audio_path, sr=16000, mono=True)
pcm16 = (audio * 32767).astype(np.int16)
return base64.b64encode(pcm16.tobytes()).decode("utf-8")
async def realtime_transcribe(audio_path: str, host: str = "127.0.0.1", port: int = 8000):
uri = f"ws://{host}:{port}/v1/realtime"
async with websockets.connect(uri) as ws:
# 1. 等待 session.created
response = json.loads(await ws.recv())
if response["type"] != "session.created":
raise RuntimeError(f"Unexpected: {response}")
# 2. 更新会话配置(指定模型)
await ws.send(json.dumps({
"type": "session.update",
"model": "mistralai/Voxtral-Mini-4B-Realtime-2602"
}))
# 3. 加载并分块发送音频
audio_bytes = base64.b64decode(audio_to_pcm16_base64(audio_path))
chunk_size = 4096
for i in range(0, len(audio_bytes), chunk_size):
chunk = audio_bytes[i:i + chunk_size]
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(chunk).decode("utf-8")
}))
# 4. 提交最终音频
await ws.send(json.dumps({
"type": "input_audio_buffer.commit",
"final": True
}))
# 5. 接收流式转录结果
while True:
response = json.loads(await ws.recv())
if response["type"] == "transcription.delta":
print(response["delta"], end="", flush=True)
elif response["type"] == "transcription.done":
print(f"\nFinal: {response['text']}")
break
elif response["type"] == "error":
raise RuntimeError(response["error"])
# 运行
asyncio.run(realtime_transcribe("audio.wav"))
消息类型说明:
-
input_audio_buffer.append:持续推送音频块 -
input_audio_buffer.commit:标记音频结束 -
transcription.delta:增量转录结果(实时返回) -
transcription.done:最终完整转录
二、或直接使用docker部署
基于官方镜像构建一个包含音频依赖的自定义镜像即可。
1. 创建 Dockerfile
在你的工作目录(如 ~/models/modelscope/)下创建一个 Dockerfile:
dockerfile
FROM docker.1ms.run/vllm/vllm-openai:v0.20.0
# 安装音频处理所需的依赖(使用清华源加速下载)
RUN uv pip install --system -i https://pypi.tuna.tsinghua.edu.cn/simple 'vllm[audio]==0.20.0' && \
uv pip install --system -i https://pypi.tuna.tsinghua.edu.cn/simple 'mistral-common[soundfile]'
说明 :安装
mistral-common[soundfile]时会自动拉取soundfile及其依赖(如 libsndfile)。同时安装vllm[audio]可以确保其他音频相关依赖也齐全。
2. 构建镜像
cd ~/models/modelscope
docker build -t vllm-voxtral-audio .
3. 运行新镜像
显存20%(24GB)不够用
docker run -d --runtime=nvidia --gpus all \
--name Voxtral-Mini-4B-Realtime-vllm \
-v ~/models/modelscope:/models/modelscope \
-p 8911:8000 \
--ipc=host \
vllm-voxtral-audio \
--model /models/modelscope/Voxtral-Mini-4B-Realtime-2602 \
--served-model-name Voxtral-Mini-4B-Realtime \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--trust-remote-code \
--compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
--tensor-parallel-size 1 \
--max-model-len 45000 \
--max-num-batched-tokens 8192 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.30 \
--host 0.0.0.0 --port 8000
验证方法
构建完镜像后,可以先检查 soundfile 是否正确安装:
docker run --rm vllm-voxtral-audio python -c "import soundfile; print('soundfile OK')"
如果看到 soundfile OK,说明依赖安装成功,再启动服务即可。
三、批量转录模式(Whisper 等传统 ASR)
对于 Whisper 等不支持原生流式的模型,vLLM 提供的是 OpenAI 兼容的 HTTP 批量接口 /v1/audio/transcriptions,并非真正的 realtime 流式。
启动命令
vllm serve openai/whisper-small \
--task transcription \
--host 0.0.0.0 --port 8002
调用方式
import requests
with open("audio.wav", "rb") as f:
response = requests.post(
"http://localhost:8002/v1/audio/transcriptions",
files={"file": f},
data={"model": "openai/whisper-small"}
)
print(response.json()["text"])
注意:此模式需等待完整音频上传后才开始推理,延迟较高,适用于离线转录场景。
四、模型支持现状总结
| 模型 | 接口类型 | 是否真实时 | vLLM 支持状态 |
|---|---|---|---|
| Voxtral Mini 4B Realtime | WebSocket /v1/realtime |
✅ 原生流式 | Day 0 支持(2026.02) |
| Qwen3-ASR | HTTP /v1/chat/completions |
❌ 不支持ws流式 | 2026.02 已合并 |
| Whisper 系列 | HTTP /v1/audio/transcriptions |
❌ 批量处理 | 稳定支持 |
| VibeVoice-ASR | HTTP (vllm-asr 扩展) | ❌ 长音频单遍 | 需额外工具 |
五、关键建议
-
模型选择:若需真正的 realtime 接口,必须选用原生支持流式架构的模型(如 Voxtral、Qwen3-ASR)。Whisper 无论通过何种包装都无法实现真正的低延迟流式,只能做"伪实时"(如客户端 VAD 分段后批量发送)。
-
硬件要求 :Voxtral Mini 4B 约需 9GB VRAM(BF16),量化后约 2.5GB;建议配置 16GB+ VRAM 的 GPU 以获得稳定性能。(实际上需要大概36GB)
-
延迟调优 :Voxtral 的延迟可在 80ms ~ 2.4s 之间配置,推荐 480ms 作为准确率与延迟的平衡点。通过调整
--max-num-batched-tokens和音频分块大小可进一步优化。 -
生产部署 :如需 Kubernetes 集群部署,可参考 vLLM Production Stack,使用
vllm-router进行后端路由与健康检查。
如需针对特定模型(如 Qwen3-ASR)的详细部署参数,或需要将 Whisper 改造为流式方案(配合 VAD + faster-whisper),可以进一步说明。
六、Voxtral部署接口:
/openapi.json, Methods: GET, HEAD /docs, Methods: GET, HEAD /docs/oauth2-redirect, Methods: GET, HEAD /redoc, Methods: GET, HEAD /tokenize, Methods: POST /detokenize, Methods: POST /load, Methods: GET /version, Methods: GET /health, Methods: GET /metrics, Methods: GET /v1/models, Methods: GET /ping, Methods: GET /ping, Methods: POST /invocations, Methods: POST /v1/chat/completions, Methods: POST /v1/chat/completions/batch, Methods: POST /v1/responses, Methods: POST /v1/responses/{response_id}, Methods: GET /v1/responses/{response_id}/cancel, Methods: POST /v1/completions, Methods: POST /v1/messages, Methods: POST /v1/messages/count_tokens, Methods: POST /inference/v1/generate, Methods: POST /scale_elastic_ep, Methods: POST /is_scaling_elastic_ep, Methods: POST /generative_scoring, Methods: POST /v1/chat/completions/render, Methods: POST /v1/completions/render, Methods: POST /v1/audio/transcriptions, Methods: POST /v1/audio/translations, Methods: POST /v1/realtime, Endpoint: realtime_endpoint
响应
(base) admin@spark-09f7:~$ curl http://localhost:8111/v1/models
{"object":"list","data":[{"id":"./Voxtral-Mini-4B-Realtime-2602","object":"model","created":1779196505,"owned_by":"vllm","root":"./Voxtral-Mini-4B-Realtime-2602","parent":null,"max_model_len":45000,"permission":[{"id":"modelperm-898d28f549e2ff8a","object":"model_permission","created":1779196505,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
部署Qwen-asr
Qwen3-ASR-1.7B 的 vLLM 部署与 Voxtral 有显著不同:它不需要 Mistral 特有的格式参数 ,且目前不支持 /v1/realtime WebSocket 原生实时接口,而是通过 HTTP /v1/chat/completions 提供流式(SSE)转录。
vllm serve ./Qwen3-ASR-1.7B \
--dtype bfloat16 \
--gpu-memory-utilization 0.85 \
--max-model-len 16384 \
--max-num-seqs 32 \
--max-num-batched-tokens 8192 \
--enforce-eager \
--enable-prefix-caching \
--limit-mm-per-prompt '{"audio":{"count":1,"length":32768}}' \
--served-model-name qwen3-asr-1.7b \
--host 0.0.0.0 \
--port 8111
接口:
/openapi.json, Methods: GET, HEAD /docs, Methods: GET, HEAD /docs/oauth2-redirect, Methods: GET, HEAD /redoc, Methods: GET, HEAD /tokenize, Methods: POST /detokenize, Methods: POST /load, Methods: GET /version, Methods: GET /health, Methods: GET /metrics, Methods: GET /v1/models, Methods: GET /ping, Methods: GET /ping, Methods: POST /invocations, Methods: POST /v1/chat/completions, Methods: POST /v1/chat/completions/batch, Methods: POST /v1/responses, Methods: POST /v1/responses/{response_id}, Methods: GET /v1/responses/{response_id}/cancel, Methods: POST /v1/completions, Methods: POST /v1/messages, Methods: POST /v1/messages/count_tokens, Methods: POST /inference/v1/generate, Methods: POST /scale_elastic_ep, Methods: POST /is_scaling_elastic_ep, Methods: POST /generative_scoring, Methods: POST /v1/chat/completions/render, Methods: POST /v1/completions/render, Methods: POST /v1/audio/transcriptions, Methods: POST /v1/audio/translations, Methods: POST
响应
curl http://localhost:8111/v1/models
{"object":"list","data":[{"id":"qwen3-asr-1.7b","object":"model","created":1779197999,"owned_by":"vllm","root":"./Qwen3-ASR-1.7B","parent":null,"max_model_len":16384,"permission":[{"id":"modelperm-bd1d075fe6b564de","object":"model_permission","created":1779197999,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}