vllm笔记(1)：最基础的离线推理

vllm是一个知名的推理框架，也是开始尝试接触vllm了。

本代码来源https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic/basic.py

python 复制代码

#LLM 是用于运行 vLLM 引擎离线推理的主类。
#SamplingParams 指定了采样过程的参数。
from vllm import LLM, SamplingParams

# 示例prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# 采样参数对象.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)


def main():
    # 创建一个 语言模型对象 llm
    llm = LLM(model="facebook/opt-125m")
    # 从提示中生成文本。
    #输出是 RequestOutput 的包含提示，生成的文本和其他信息对象列表。
    outputs = llm.generate(prompts, sampling_params)
    # 打印输出.
    print("\nGenerated Outputs:\n" + "-" * 60)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt:    {prompt!r}")
        print(f"Output:    {generated_text!r}")
        print("-" * 60)


if __name__ == "__main__":
    main()

注意，默认模型会从huggingface加载，如果你网络连接不了huggingface，可以尝试先搞到本地，再把model=的路径换为本地模型的路径（vllm支持safetensors或者gguf格式的模型）

运行结果（我是使用了本地一个能用的模型，所以生成的文本逻辑不太合理）

bash 复制代码

INFO 01-06 14:12:05 [utils.py:253] non-default args: {'disable_log_stats': True, 'model': '/home/huangxy/models/SmolLM3-3B'}
INFO 01-06 14:12:25 [model.py:514] Resolved architecture: SmolLM3ForCausalLM
INFO 01-06 14:12:25 [model.py:1661] Using max model len 65536
INFO 01-06 14:12:27 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=1351534) INFO 01-06 14:12:30 [core.py:93] Initializing a V1 LLM engine (v0.13.0) with config: model='/home/huangxy/models/SmolLM3-3B', speculative_config=None, tokenizer='/home/huangxy/models/SmolLM3-3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=/home/huangxy/models/SmolLM3-3B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=1351534) INFO 01-06 14:12:32 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.187:40795 backend=nccl
(EngineCore_DP0 pid=1351534) INFO 01-06 14:12:32 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1351534) INFO 01-06 14:12:34 [gpu_model_runner.py:3562] Starting to load model /home/huangxy/models/SmolLM3-3B...
(EngineCore_DP0 pid=1351534) INFO 01-06 14:12:35 [base.py:131] Using Transformers modeling backend.
(EngineCore_DP0 pid=1351534) INFO 01-06 14:13:45 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:05<00:05,  5.81s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:07<00:00,  3.50s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:07<00:00,  3.84s/it]
(EngineCore_DP0 pid=1351534) 
(EngineCore_DP0 pid=1351534) INFO 01-06 14:13:53 [default_loader.py:308] Loading weights took 7.82 seconds
(EngineCore_DP0 pid=1351534) INFO 01-06 14:13:55 [gpu_model_runner.py:3659] Model loading took 5.8246 GiB memory and 78.636008 seconds
(EngineCore_DP0 pid=1351534) INFO 01-06 14:14:25 [backends.py:643] Using cache directory: /home/huangxy/.cache/vllm/torch_compile_cache/6ad522e62d/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1351534) INFO 01-06 14:14:25 [backends.py:703] Dynamo bytecode transform time: 29.10 s
(EngineCore_DP0 pid=1351534) /home/huangxy/Programs/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
(EngineCore_DP0 pid=1351534)   return torch._C._get_cublas_allow_tf32()
(EngineCore_DP0 pid=1351534) /home/huangxy/Programs/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:312: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
(EngineCore_DP0 pid=1351534)   warnings.warn(
(EngineCore_DP0 pid=1351534) INFO 01-06 14:14:56 [backends.py:261] Cache the graph of compile range (1, 8192) for later use
(EngineCore_DP0 pid=1351534) INFO 01-06 14:15:15 [backends.py:278] Compiling a graph for compile range (1, 8192) takes 37.05 s
(EngineCore_DP0 pid=1351534) INFO 01-06 14:15:15 [monitor.py:34] torch.compile takes 66.15 s in total
(EngineCore_DP0 pid=1351534) INFO 01-06 14:15:18 [gpu_worker.py:375] Available KV cache memory: 14.25 GiB
(EngineCore_DP0 pid=1351534) INFO 01-06 14:15:20 [kv_cache_utils.py:1291] GPU KV cache size: 207,536 tokens
(EngineCore_DP0 pid=1351534) INFO 01-06 14:15:20 [kv_cache_utils.py:1296] Maximum concurrency for 65,536 tokens per request: 3.17x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████| 51/51 [00:07<00:00,  6.68it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████████████████████████| 35/35 [00:05<00:00,  6.69it/s]
(EngineCore_DP0 pid=1351534) INFO 01-06 14:15:36 [gpu_model_runner.py:4587] Graph capturing finished in 16 secs, took 0.72 GiB
(EngineCore_DP0 pid=1351534) INFO 01-06 14:15:36 [core.py:259] init engine (profile, create kv cache, warmup model) took 100.49 seconds
INFO 01-06 14:15:38 [llm.py:360] Supported tasks: ['generate']
Adding requests: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 358.67it/s]
Processed prompts: 100%|███████| 4/4 [00:00<00:00, 11.01it/s, est. speed input: 60.64 toks/s, output: 176.39 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    ' Helen and I am a dorky southern girl with an undying love for'
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' elected to serve for four years, and the vice president is elected to serve for'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' Paris, located on the banks of the Seine River. Paris is a global'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' here now. Artificial Intelligence and machine learning are increasingly being used in business operations,'
------------------------------------------------------------

1️⃣ 模型加载情况

模型类型 ：SmolLM3ForCausalLM（根据日志 Resolved architecture: SmolLM3ForCausalLM）
加载方式 ：本地目录 /home/huangxy/models/SmolLM3-3B
权重格式 ：safetensors 分片 + 分片索引 model.safetensors.index.json
加载时间：大约 78 秒，用了 5.8 GiB GPU 显存
Tokenizer：使用目录下的 tokenizer 自动初始化

2️⃣ vLLM 内部优化

Chunked prefill ：启用，max_num_batched_tokens=8192 → 大模型分块处理，节省显存
CUDA 图捕获：
- Capturing CUDA graphs (mixed prefill-decode, PIECEWISE)
- Capturing CUDA graphs (decode, FULL)
  
  说明 vLLM 使用 CUDA Graphs 来加速推理
KV Cache：
- Available KV cache memory: 14.25 GiB
- GPU KV cache size: 207,536 tokens

SamplingParams的构造

python 复制代码

class SamplingParams(
    n: int = 1,
    presence_penalty: float = 0,
    frequency_penalty: float = 0,
    repetition_penalty: float = 1,
    temperature: float = 1,
    top_p: float = 1,
    top_k: int = 0,
    min_p: float = 0,
    seed: int | None = None,
    stop: str | list[str] | None = None,
    stop_token_ids: list[int] | None = None,
    ignore_eos: bool = False,
    max_tokens: int | None = 16,
    min_tokens: int = 0,
    logprobs: int | None = None,
    prompt_logprobs: int | None = None,
    flat_logprobs: bool = False,
    detokenize: bool = True,
    skip_special_tokens: bool = True,
    spaces_between_special_tokens: bool = True,
    logits_processors: Any | None = None,
    include_stop_str_in_output: bool = False,
    truncate_prompt_tokens: int | None = None,
    output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE,
    output_text_buffer_length: int = 0,
    _all_stop_token_ids: set[int] = set,
    structured_outputs: StructuredOutputsParams | None = None,
    logit_bias: dict[int, float] | None = None,
    allowed_token_ids: list[int] | None = None,
    extra_args: dict[str, Any] | None = None,
    bad_words: list[str] | None = None,
    _bad_words_token_ids: list[list[int]] | None = None,
    skip_reading_prefix_cache: bool | None = None
)

生成数量

n: int = 1

指定每个 prompt 生成多少条输出。
说明：

AsyncLLM 默认流式输出生成内容。
当 n > 1 时，会生成所有 n 条输出并累积流式返回。
如果想一次性查看所有 n 条完整输出，可以将 output_kind=RequestOutputKind.FINAL_ONLY。

重复与惩罚

presence_penalty: float = 0.0

对新生成的 token 进行惩罚，基于它是否已经出现在生成文本中。
- 大于 0 → 鼓励使用新 token
- 小于 0 → 鼓励重复 token
frequency_penalty: float = 0.0

对新生成 token 进行惩罚，基于它在生成文本中出现的频率。
- 大于 0 → 鼓励使用新 token
- 小于 0 → 鼓励重复 token
repetition_penalty: float = 1.0

对新生成 token 进行惩罚，基于它是否出现在 prompt 或已生成文本 中。
- 1 → 鼓励新 token
- <1 → 鼓励重复 token

总结：这三个参数用于控制重复和多样性，避免模型输出死循环或内容太重复。

采样随机性

temperature: float = 1.0

控制生成随机性。
- 低值 → 输出更确定
- 高值 → 输出更随机
- 0 → 贪心采样
top_p: float = 1.0

核采样（nucleus sampling），选择累计概率 ≤ top_p 的 token。
- 设置为 1 → 考虑所有 token
top_k: int = 0

top-k 采样，选择概率最高的 k 个 token。
- 0 或 -1 → 考虑所有 token
min_p: float = 0.0

token 被考虑的最小概率（相对最可能 token 的概率）
- 0 → 禁用最小概率限制
seed: int | None = None

随机种子，用于生成复现结果。

停止条件

stop: str | list[str] | None = None

指定停止生成的字符串或字符串列表。生成到 stop 字符串时终止，输出中不包含 stop 字符串。
stop_token_ids: list[int] | None = None

指定停止 token id。输出会包含 stop token（除非它是特殊 token）。
ignore_eos: bool = False

是否忽略 EOS token，继续生成。
include_stop_str_in_output: bool = False

是否在输出中包含 stop 字符串。

生成长度

max_tokens: int | None = 16

每条输出最多生成的 token 数量。
min_tokens: int = 0

在生成少于 min_tokens 的 token 之前，即使遇到 EOS 或 stop_token_ids 也不会停止生成。

概率 / 输出信息

logprobs: int | None = None

返回每个生成 token 的 log 概率数量。
- None → 不返回
- 非 None → 返回指定数量最可能 token 的 log 概率 + 生成 token 的概率
- -1 → 返回整个 vocab 的 log 概率
prompt_logprobs: int | None = None

返回每个 prompt token 的 log 概率。-1 → 返回 vocab_size 全部 log 概率
flat_logprobs: bool = False

是否返回扁平化 logprob（FlatLogprob），性能更好，GC 消耗更小。
detokenize: bool = True

是否将 token 解码成字符串输出。
skip_special_tokens: bool = True

是否跳过特殊 token（如 <pad> <eos>）。
spaces_between_special_tokens: bool = True

特殊 token 之间是否加空格。

Logits / 结构化输出

logits_processors: Any | None = None

函数列表，用于根据已生成 token 修改 logits，可用于自定义采样策略。
structured_outputs: StructuredOutputsParams | None = None

配置结构化输出（如 JSON / 表格等可解析输出）。
logit_bias: dict[int, float] | None = None

对指定 token 的 logits 施加偏置（鼓励或惩罚生成某些 token）。
allowed_token_ids: list[int] | None = None

仅允许生成列表里的 token，其他 token 会被屏蔽。
bad_words: list[str] | None = None

不允许生成的单词列表（只屏蔽能完成这些词的最后一个 token）。

提示 / 前缀缓存

truncate_prompt_tokens: int | None = None

如果提示太长，截断多少 token。
- -1 → 使用模型支持的最大截断长度
- k → 仅使用最后 k 个 token
- None → 不截断
skip_reading_prefix_cache: bool | None = None

是否跳过前缀缓存（KV cache 的性能优化参数）。

输出类型

output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE

输出形式：
- CUMULATIVE → 累积生成文本
- FINAL_ONLY → 仅返回最终生成结果

LLM的构造

LLM 是 vLLM 用于文本生成的核心类。它的主要职责：

管理模型和 tokenizer
- 自动加载 HuggingFace Transformers 模型（可以是本地路径或远程模型）
- 管理 tokenizer，用于将文本转换为 token（以及反向解码）
管理 GPU 内存与 KV Cache
- KV Cache 存储中间状态，用于加速自回归生成
- 支持多 GPU 分布式执行（Tensor Parallel、Pipeline Parallel）
高效批处理与推理
- 给定一批 prompt 和采样参数（SamplingParams），使用智能 batching 和内存管理生成文本
- 支持流式生成（AsyncLLM）或批量生成

换句话说，LLM 就是 vLLM 的 "推理引擎"：你给它 prompt + 采样参数，它返回生成文本，同时底层负责 GPU 内存分配、KV Cache 管理、多 GPU 并行等优化。

参数	类型	说明	默认值
`model`	str	HuggingFace 模型名称或本地路径	必填
`tokenizer`	str \| None	HuggingFace tokenizer 路径或名称	None
`tokenizer_mode`	TokenizerMode \| str	"auto" → 尝试使用 fast tokenizer，"slow" → 使用慢 tokenizer	'auto'
`skip_tokenizer_init`	bool	是否跳过 tokenizer 初始化（需要传入 token id）	False
`trust_remote_code`	bool	是否信任远程代码（例如 HuggingFace repo 的自定义模型代码）	False
`allowed_local_media_path`	str	允许读取本地媒体路径（安全风险）	''
`allowed_media_domains`	list[str] \| None	允许使用的媒体 URL 域名	None
`tensor_parallel_size`	int	Tensor Parallel GPU 数量	1
`dtype`	ModelDType	模型权重和激活的 dtype（float32/float16/bfloat16/auto）	'auto'
`quantization`	QuantizationMethods \| None	模型量化方法，如 awq、gptq、fp8	None
`revision`	str \| None	模型版本（branch/tag/commit id）	None
`tokenizer_revision`	str \| None	tokenizer 版本	None
`seed`	int	随机种子，用于生成复现	0
`gpu_memory_utilization`	float	GPU 内存占用比例（权重+激活+KV Cache），值越高 KV Cache 越大，但可能 OOM	0.9
`kv_cache_memory_bytes`	int \| None	每 GPU KV Cache 大小（字节），优先于 `gpu_memory_utilization`	None
`swap_space`	float	CPU 作为 swap 存储大小（GiB），用于 best_of>1 的请求	4
`cpu_offload_gb`	float	CPU offload 模型权重大小（虚拟增加 GPU 内存）	0
`enforce_eager`	bool	是否强制 eager 执行（不使用 CUDA Graph）	False
`disable_custom_all_reduce`	bool	分布式相关，是否禁用自定义 all-reduce	False
`hf_token`	bool/str/None	HuggingFace token，用于下载私有模型	None
`hf_overrides`	HfOverrides \| None	可覆盖 HuggingFace 配置	None
`mm_processor_kwargs`	dict[str, Any] \| None	多模态输入参数（如图像 processor）	None
`pooler_config`	PoolerConfig \| None	池化器配置	None
`compilation_config`	int \| dict \| CompilationConfig \| None	Torch.compile 优化配置	None
`attention_config`	dict \| AttentionConfig \| None	注意力机制配置（如 FLASH/FLASHINFER/FLEX 等）	None
`**kwargs`	Any	其他引擎参数	{}

LLM的generate方法

python 复制代码

def generate(
        self,
        prompts: PromptType | Sequence[PromptType],
        sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
        *,
        use_tqdm: bool | Callable[..., tqdm] = True,
        lora_request: list[LoRARequest] | LoRARequest | None = None,
        priority: list[int] | None = None,
    ) -> list[RequestOutput]:

参数	类型	说明
`prompts`	PromptType 或序列	输入的文本或 prompt 序列。批量生成时传入列表，生成结果顺序与输入顺序一致。
`sampling_params`	SamplingParams 或序列	文本生成的采样参数。- None → 使用默认参数- 单个值 → 所有 prompt 使用同一采样参数- 列表 → 每条 prompt 对应一个采样参数
`use_tqdm`	bool 或 callable	是否显示进度条，或自定义进度条生成函数
`lora_request`	LoRARequest 或列表	LoRA 微调请求，可以在推理阶段对模型权重做轻量调整
`priority`	list[int]	每条 prompt 的优先级，仅在启用优先级调度时生效
返回值	`list[RequestOutput]`	每条 prompt 对应的生成结果对象，包含生成文本、token 信息、logprobs 等