用 vLLM-Omni 分别起 3 个推理服务(文生图 / 单图改图 / 多图改图),再用一个 FastAPI 网关统一对外提供接口(你可以做成 OpenAI 风格的 /v1/images/generations + /v1/images/edits)。
关键事实:
• vLLM-Omni 提供 OpenAI DALL·E 兼容的 /v1/images/generations(文生图),并且一个 server 实例只跑一个模型。 
• 图改图在 vLLM-Omni 的典型做法是:起 Qwen-Image-Edit(或 2509 多图版),通过 /v1/chat/completions 发送 text + image(s) 完成编辑。 
• vLLM-Omni 官方 Docker 镜像是 vllm/vllm-omni。 
• Z-Image-Turbo 官方描述可在 16GB VRAM 的消费级显卡上跑得比较舒服。 
⸻
- 目标架构
• vLLM-Omni 文生图:Tongyi-MAI/Z-Image-Turbo → 端口 8000 → /v1/images/generations 
• vLLM-Omni 单图编辑:Qwen/Qwen-Image-Edit → 端口 8092 → /v1/chat/completions 
• vLLM-Omni 多图编辑:Qwen/Qwen-Image-Edit-2509 → 端口 8093 → /v1/chat/completions(messages 里放多张图) 
• FastAPI 网关:统一对外
• POST /v1/images/generations(JSON)→ 转发到 Z-Image 服务
• POST /v1/images/edits(multipart,单图)→ 转发到 Qwen-Image-Edit
• POST /v1/images/edits/multi(multipart,多图)→ 转发到 Qwen-Image-Edit-2509
注:你也可以只部署"1 个编辑模型",如果你不需要多图;或只部署文生图。
⸻
- 服务器建议(按最小可用 / 推荐)
最小可用(体验 / 内测)
• GPU:16GB VRAM(Z-Image-Turbo 官方说 16G VRAM 适配良好) 
• CPU:8 核+
• RAM:32GB+
• SSD:NVMe 200GB+(HF cache + logs)
但注意:你如果要同时跑 3 个模型服务,建议至少 2--3 张 GPU,或按需拆分到不同机器/不同时间启动(因为每个 vLLM-Omni 实例是一个模型)。 
推荐(要对外服务)
• GPU:24GB VRAM 起更稳(并发、1024/更高分辨率、编辑模型更从容)
• RAM:64GB+
• CPU:16--32 核
⸻
- vLLM-Omni 部署方式(推荐 Docker Compose)
2.1 机器前置
-
装 NVIDIA 驱动 + CUDA
-
装 Docker + nvidia-container-toolkit(让容器能用 GPU)
2.2 docker-compose.yml(3 个 vLLM-Omni + 1 个 FastAPI 网关)
vLLM-Omni 官方镜像是 vllm/vllm-omni。 
bash
version: "3.9"
services:
vllm_zimage:
image: vllm/vllm-omni:v0.12.0rc1
restart: unless-stopped
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
- HF_HOME=/root/.cache/huggingface
# - HF_TOKEN=xxx # 如模型需要token再启用
volumes:
- ./hf_cache:/root/.cache/huggingface
ipc: host
ports:
- "8000:8000"
command:
[
"vllm", "serve", "Tongyi-MAI/Z-Image-Turbo",
"--omni",
"--host", "0.0.0.0",
"--port", "8000"
]
vllm_edit_single:
image: vllm/vllm-omni:v0.12.0rc1
restart: unless-stopped
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=1
- HF_HOME=/root/.cache/huggingface
volumes:
- ./hf_cache:/root/.cache/huggingface
ipc: host
ports:
- "8092:8092"
command:
[
"vllm", "serve", "Qwen/Qwen-Image-Edit",
"--omni",
"--host", "0.0.0.0",
"--port", "8092"
]
vllm_edit_multi:
image: vllm/vllm-omni:v0.12.0rc1
restart: unless-stopped
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=2
- HF_HOME=/root/.cache/huggingface
volumes:
- ./hf_cache:/root/.cache/huggingface
ipc: host
ports:
- "8093:8093"
command:
[
"vllm", "serve", "Qwen/Qwen-Image-Edit-2509",
"--omni",
"--host", "0.0.0.0",
"--port", "8093"
]
api_gateway:
build: ./gateway
restart: unless-stopped
environment:
- VLLM_T2I_BASE=http://vllm_zimage:8000
- VLLM_EDIT_SINGLE_BASE=http://vllm_edit_single:8092
- VLLM_EDIT_MULTI_BASE=http://vllm_edit_multi:8093
- API_AUTH_TOKEN=change-me
- MAX_UPLOAD_MB=20
- HTTP_TIMEOUT_S=300
ports:
- "9000:9000"
depends_on:
- vllm_zimage
- vllm_edit_single
- vllm_edit_multi
启动:
bash
docker compose up -d
⸻
- FastAPI 网关项目(完整可跑)
3.1 目录结构
bash
gateway/
Dockerfile
requirements.txt
app/
__init__.py
main.py
config.py
schemas.py
vllm_client.py
utils_images.py
3.2 requirements.txt
bash
fastapi==0.115.6
uvicorn[standard]==0.30.6
httpx==0.27.2
pydantic==2.9.2
python-multipart==0.0.12
3.3 Dockerfile
bash
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app ./app
EXPOSE 9000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "9000"]
⸻
- FastAPI 代码
4.1 app/config.py
bash
from pydantic import BaseModel
import os
class Settings(BaseModel):
vllm_t2i_base: str = os.getenv("VLLM_T2I_BASE", "http://localhost:8000")
vllm_edit_single_base: str = os.getenv("VLLM_EDIT_SINGLE_BASE", "http://localhost:8092")
vllm_edit_multi_base: str = os.getenv("VLLM_EDIT_MULTI_BASE", "http://localhost:8093")
api_auth_token: str = os.getenv("API_AUTH_TOKEN", "change-me")
max_upload_mb: int = int(os.getenv("MAX_UPLOAD_MB", "20"))
http_timeout_s: int = int(os.getenv("HTTP_TIMEOUT_S", "300"))
settings = Settings()
4.2 app/schemas.py
bash
from pydantic import BaseModel, Field
from typing import Optional, Literal, Any, Dict
class ImageGenerateRequest(BaseModel):
# OpenAI DALL·E compatible fields (pass-through)
prompt: str
n: int = 1
size: str = "1024x1024"
response_format: Literal["b64_json", "url"] = "b64_json"
# Optional: vendor-specific extras
extra_body: Optional[Dict[str, Any]] = None
class OpenAIImageData(BaseModel):
b64_json: Optional[str] = None
url: Optional[str] = None
class OpenAIImageResponse(BaseModel):
created: int
data: list[OpenAIImageData]
4.3 app/utils_images.py
bash
import base64
def bytes_to_data_url(img_bytes: bytes, mime: str) -> str:
b64 = base64.b64encode(img_bytes).decode("utf-8")
return f"data:{mime};base64,{b64}"
def guess_mime(filename: str, content_type: str | None) -> str:
if content_type:
return content_type
fn = (filename or "").lower()
if fn.endswith(".png"):
return "image/png"
if fn.endswith(".jpg") or fn.endswith(".jpeg"):
return "image/jpeg"
if fn.endswith(".webp"):
return "image/webp"
return "image/png"
def strip_data_url_prefix(data_url: str) -> str:
# data:image/png;base64,xxxx
if "," in data_url:
return data_url.split(",", 1)[1]
return data_url
4.4 app/vllm_client.py
bash
import httpx
from typing import Any, Dict, List, Optional
class VLLMClient:
def __init__(self, base_url: str, timeout_s: int = 300):
self.base_url = base_url.rstrip("/")
self.timeout = httpx.Timeout(timeout_s)
async def t2i_generate(self, payload: Dict[str, Any]) -> Dict[str, Any]:
url = f"{self.base_url}/v1/images/generations"
async with httpx.AsyncClient(timeout=self.timeout) as client:
r = await client.post(url, json=payload)
r.raise_for_status()
return r.json()
async def chat_completions(self, payload: Dict[str, Any]) -> Dict[str, Any]:
url = f"{self.base_url}/v1/chat/completions"
async with httpx.AsyncClient(timeout=self.timeout) as client:
r = await client.post(url, json=payload)
r.raise_for_status()
return r.json()
def build_edit_chat_payload(
prompt: str,
image_data_urls: List[str],
*,
size: str = "1024x1024",
num_inference_steps: int = 50,
guidance_scale: float = 7.5,
seed: Optional[int] = None,
negative_prompt: Optional[str] = None,
num_outputs_per_prompt: int = 1,
) -> Dict[str, Any]:
# vLLM-Omni 图改图示例:messages.content 中放 text + image_url(可多张) [oai_citation:10‡vLLM](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/image_to_image/?utm_source=chatgpt.com)
content = [{"type": "text", "text": prompt}]
for u in image_data_urls:
content.append({"type": "image_url", "image_url": {"url": u}})
extra_body: Dict[str, Any] = {
"size": size,
"num_inference_steps": num_inference_steps,
"guidance_scale": guidance_scale,
"num_outputs_per_prompt": num_outputs_per_prompt,
}
if seed is not None:
extra_body["seed"] = seed
if negative_prompt:
extra_body["negative_prompt"] = negative_prompt
return {
"messages": [{"role": "user", "content": content}],
"extra_body": extra_body,
}
4.5 app/main.py
bash
import time
from fastapi import FastAPI, UploadFile, File, Form, HTTPException, Depends, Request
from fastapi.responses import JSONResponse
from typing import Optional, List
from .config import settings
from .schemas import ImageGenerateRequest
from .vllm_client import VLLMClient, build_edit_chat_payload
from .utils_images import bytes_to_data_url, guess_mime, strip_data_url_prefix
app = FastAPI(title="vLLM-Omni Image Gateway", version="1.0.0")
t2i = VLLMClient(settings.vllm_t2i_base, settings.http_timeout_s)
edit_single = VLLMClient(settings.vllm_edit_single_base, settings.http_timeout_s)
edit_multi = VLLMClient(settings.vllm_edit_multi_base, settings.http_timeout_s)
def auth(request: Request):
# 简单 Bearer Token 鉴权(你也可以换成 API Key / JWT / 内网放行)
got = request.headers.get("Authorization", "")
if settings.api_auth_token and got != f"Bearer {settings.api_auth_token}":
raise HTTPException(status_code=401, detail="Unauthorized")
def _enforce_size(upload: UploadFile):
# python-multipart 没有直接给 size,这里用"读入内存"的方式做上限控制(简单稳)
# 生产建议:用 nginx/client_max_body_size + streaming 保存到临时文件再处理
pass
@app.get("/healthz")
async def healthz():
return {"ok": True}
# 1) 文生图:对外保持 OpenAI DALL·E 风格,内部转发到 vLLM-Omni /v1/images/generations [oai_citation:11‡vLLM](https://docs.vllm.ai/projects/vllm-omni/en/latest/serving/image_generation_api/?utm_source=chatgpt.com)
@app.post("/v1/images/generations")
async def images_generations(req: ImageGenerateRequest, _: None = Depends(auth)):
payload = req.model_dump(exclude_none=True)
resp = await t2i.t2i_generate(payload)
return JSONResponse(resp)
def _extract_first_image_data_url(chat_resp: dict) -> str:
# vLLM-Omni 的编辑示例会在 assistant message content 中返回 image_url (data URL)
try:
content = chat_resp["choices"][0]["message"]["content"]
# content 是 list[{"type":"image_url","image_url":{"url":"data:..."}} ...]
for item in content:
if item.get("type") == "image_url":
return item["image_url"]["url"]
except Exception:
pass
raise HTTPException(502, detail="Upstream response has no image_url")
# 2) 单图改图:multipart 上传一张图 + prompt
@app.post("/v1/images/edits")
async def images_edits(
_: None = Depends(auth),
prompt: str = Form(...),
image: UploadFile = File(...),
size: str = Form("1024x1024"),
num_inference_steps: int = Form(50),
guidance_scale: float = Form(7.5),
seed: Optional[int] = Form(None),
negative_prompt: Optional[str] = Form(None),
n: int = Form(1),
):
img_bytes = await image.read()
if len(img_bytes) > settings.max_upload_mb * 1024 * 1024:
raise HTTPException(413, detail="Image too large")
mime = guess_mime(image.filename, image.content_type)
data_url = bytes_to_data_url(img_bytes, mime)
payload = build_edit_chat_payload(
prompt=prompt,
image_data_urls=[data_url],
size=size,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
seed=seed,
negative_prompt=negative_prompt,
num_outputs_per_prompt=n,
)
chat_resp = await edit_single.chat_completions(payload)
out_data_url = _extract_first_image_data_url(chat_resp)
# 返回 OpenAI images 风格(b64_json)
created = int(time.time())
return {
"created": created,
"data": [{"b64_json": strip_data_url_prefix(out_data_url)}],
}
# 3) 多图改图:multipart 上传多张图 + prompt(例如融合/参考多图修改)
@app.post("/v1/images/edits/multi")
async def images_edits_multi(
_: None = Depends(auth),
prompt: str = Form(...),
images: List[UploadFile] = File(...),
size: str = Form("1024x1024"),
num_inference_steps: int = Form(50),
guidance_scale: float = Form(7.5),
seed: Optional[int] = Form(None),
negative_prompt: Optional[str] = Form(None),
n: int = Form(1),
):
if not images:
raise HTTPException(400, detail="No images uploaded")
data_urls: List[str] = []
total = 0
for img in images:
b = await img.read()
total += len(b)
if total > settings.max_upload_mb * 1024 * 1024:
raise HTTPException(413, detail="Images too large (total)")
mime = guess_mime(img.filename, img.content_type)
data_urls.append(bytes_to_data_url(b, mime))
payload = build_edit_chat_payload(
prompt=prompt,
image_data_urls=data_urls,
size=size,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
seed=seed,
negative_prompt=negative_prompt,
num_outputs_per_prompt=n,
)
chat_resp = await edit_multi.chat_completions(payload)
out_data_url = _extract_first_image_data_url(chat_resp)
created = int(time.time())
return {
"created": created,
"data": [{"b64_json": strip_data_url_prefix(out_data_url)}],
}
⸻
- 调用示例(验证链路)
5.1 文生图
bash
curl http://localhost:9000/v1/images/generations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer change-me" \
-d '{
"prompt": "写实摄影风格,一杯咖啡放在木桌上,中文文字:早安",
"n": 1,
"size": "1024x1024",
"response_format": "b64_json"
}'
5.2 单图改图
bash
curl http://localhost:9000/v1/images/edits \
-H "Authorization: Bearer change-me" \
-F 'prompt=把这张图改成水彩风格,保留主体结构不变' \
-F 'size=1024x1024' \
-F 'image=@./input.png'
5.3 多图改图(上传多张参考图)
bash
curl http://localhost:9000/v1/images/edits/multi \
-H "Authorization: Bearer change-me" \
-F 'prompt=参考这几张图的主体和配色,生成统一风格的海报' \
-F 'size=1024x1024' \
-F 'images=@./a.png' \
-F 'images=@./b.png' \
-F 'images=@./c.png'
⸻
- 生产化要点(你后面一定会用到)
- 队列化:编辑/生成都建议走队列(Celery/RQ/自研)避免 GPU 被并发打爆
- 限流:按 size、steps、n 做"成本权重",动态限流
- 网关超时:HTTP_TIMEOUT_S 建议 120--300s(看你分辨率/steps)
- 请求体上限:nginx client_max_body_size + FastAPI MAX_UPLOAD_MB 双保险
- 多模型调度:因为 vLLM-Omni "一实例一模型",你可以:
• 多 GPU:一模型绑一张卡(compose 里 NVIDIA_VISIBLE_DEVICES)
• 单 GPU:只启用你当前需要的那一个服务,或拆机器