tensorrt-llm部署Qwen-VL

首先搭建环境(略)并下载好huggingface的Qwen2-VL-2B-Instruct模型。

测试代码:

python 复制代码
'''
Author: taifyang
Date: 2026-04-09 16:54:52
LastEditTime: 2026-05-12 16:47:14
Description: 
'''
import argparse
import sys
from pathlib import Path

from PIL import Image
from transformers import AutoProcessor

from tensorrt_llm import LLM
from tensorrt_llm.sampling_params import SamplingParams


def load_image(image_path: str) -> Image.Image:
    path = Path(image_path)
    if not path.exists():
        raise FileNotFoundError(f"图片不存在: {image_path}")
    img = Image.open(path).convert("RGB")
    print(f"成功加载图片: {path.name},尺寸: {img.size}")
    return img


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_path", type=str, default="/docker_share/TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct")
    parser.add_argument("--image_path", type=str, default="/docker_share/TensorRT-LLM-1.2.0/bus.jpg")
    parser.add_argument("--prompt", type=str, default="图片中有什么?")
    parser.add_argument("--max_tokens", type=int, default=512)
    parser.add_argument("--temperature", type=float, default=0.5)
    args = parser.parse_args()

    # 1. 加载 processor(用于构建 prompt)
    processor = AutoProcessor.from_pretrained(args.model_path, trust_remote_code=True)
    print(f"成功加载 processor: {args.model_path}")

    # 2. 加载 TensorRT-LLM 模型
    print("正在加载 TensorRT-LLM 引擎...")
    llm = LLM(
        model=args.model_path,
        trust_remote_code=True,
        tokenizer=processor.tokenizer,
    )
    tokenizer = llm.tokenizer

    # 3. 加载原始图像
    image = load_image(args.image_path)

    # 4. 构建带视觉占位符的 prompt
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": args.prompt}
            ]
        }
    ]
    prompt_text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    print(f"构造的 prompt (前200字符): {prompt_text[:200]}...")

    # 5. 构建输入:图像必须放在列表中
    inputs = [
        {
            "prompt": prompt_text,
            "multi_modal_data": {
                "image": [image]   # ← 关键修复:列表包裹
            }
        }
    ]

    # 6. 生成参数
    sampling_params = SamplingParams(
        max_tokens=args.max_tokens,
        temperature=args.temperature,
        top_p=0.9,
        stop_token_ids=[tokenizer.eos_token_id],
        stop=["<|im_end|>"],
    )

    # 7. 推理
    print("\n开始生成...")
    outputs = llm.generate(inputs, sampling_params=sampling_params)

    # 8. 输出结果
    print("\n===== 模型回答 =====")
    for output in outputs:
        print(output.outputs[0].text.strip())
    print("====================")


if __name__ == "__main__":
    main()

测试图片:

输出结果:

bash 复制代码
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
    registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
  dispatch key: ADInplaceOrView
  previous kernel: no debug info
       new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
  self.m.impl(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.3 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.2.0
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
成功加载 processor: /docker_share/TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct
正在加载 TensorRT-LLM 引擎...
[05/12/2026-08:47:28] [TRT-LLM] [I] Using LLM with PyTorch backend
[05/12/2026-08:47:28] [TRT-LLM] [W] Using default gpus_per_node: 1
[05/12/2026-08:47:28] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen2_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
`torch_dtype` is deprecated! Use `dtype` instead!
rank 0 using MpiPoolSession to spawn MPI processes
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
    registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
  dispatch key: ADInplaceOrView
  previous kernel: no debug info
       new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
  self.m.impl(
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package modelopt. Picked distribution: nvidia-modelopt
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.3 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.2.0
[TensorRT-LLM][INFO] Refreshed the MPI local session
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:104: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  class ResponseFormat(OpenAIBaseModel):
`torch_dtype` is deprecated! Use `dtype` instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00,  1.63iLoading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00,  1.63it/s]
Loading weights concurrently: 100%|██████████| 568/568 [00:20<00:00, 28.29it/s]
Model init total -- 65.20s
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 1025 [window size=32769], tokens per block=32, primary blocks=2048, secondary blocks=0, max sequence length=32769
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.75 GiB for max tokens in paged KV cache (65536).
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 59335680 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 42006528 bytes
[TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2048
[TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2052
[TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2049
[TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2050
[TensorRT-LLM][WARNING] [kv cache manager] storeContextBlocks: Can not find sequence for request 2051
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 1025 [window size=32769], tokens per block=32, primary blocks=5127, secondary blocks=0, max sequence length=32769
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.38 GiB for max tokens in paged KV cache (164064).
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 59335680 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 42006528 bytes
成功加载图片: bus.jpg,尺寸: (810, 1080)
构造的 prompt (前200字符): <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>图片中有什么?<|im_end|>
<|im_start|>assistant
...

开始生成...
/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/reductions.py:598: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if storage.is_cuda:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Processed requests: 100%|███████████████████████████| 1/1 [00:00<00:00,  1.13it/s]

===== 模型回答 =====
图片中有一辆蓝色的电动公交车,上面有"cero emisiones"(零排放)的标志。公交车停在街道上,周围有行人。背景是一栋黄色的建筑,窗户上有绿色的栏杆。
====================

显存占用约10G。

相关推荐
CoderJia程序员甲10 分钟前
GitHub 热榜项目 - 周榜(2026-06-06)
ai·大模型·llm·github
qq7422349841 小时前
从“感知”到“决断”:测评百度伐谋产业决策智能体的端到端推理与行动机制
人工智能·算法·百度·大模型·运筹优化
张彦峰ZYF4 小时前
LangGraph 条件边:让 AI Agent 学会“做选择”
人工智能·大模型·langgraph
Mr.朱鹏4 小时前
科技资讯日报 · 2026-06-05
科技·ai·大模型·业界资讯
心之伊始4 小时前
Java 后端 AI 应用网关实战:多模型路由、Fallback、超时和可观测性设计
java·spring boot·大模型·架构设计·ai网关
龙侠九重天5 小时前
C# 构建 AI Agent 系统 — 我的实践笔记
开发语言·人工智能·语言模型·自然语言处理·大模型·agent·智能体
鲲鹏AI探索局6 小时前
大模型问答里的品牌信息一致性检查:先做定位,再做内容
人工智能·大模型·aigc
jarreyer7 小时前
【AI工具】bilinote
大模型
星马梦缘7 小时前
MCP 模型上下文协议、Agent Skills 智能体技能、Harness操作系统 课程内容
人工智能·大模型·llm·agent·智能体·mcp·skills