文章目录
- [1 整体结构](#1 整体结构)
-
- [1.1 模块](#1.1 模块)
- [1.2 周边](#1.2 周边)
- [1.3 优化](#1.3 优化)
- [2 模块](#2 模块)
-
- [2.1 Entrypoint--入口](#2.1 Entrypoint--入口)
- [2.2 engine](#2.2 engine)
- [2.3 schedule](#2.3 schedule)
- [2.4 KV Cache manager](#2.4 KV Cache manager)
- [2.5 evictor](#2.5 evictor)
- [2.6 Worker](#2.6 Worker)
- [2.7 Model executor](#2.7 Model executor)
- [2.8 Modelling](#2.8 Modelling)
- [2.9 Attention backend](#2.9 Attention backend)
- 参考文献
这篇博客是在看[EP01][精剪版] vllm源码讲解,基本代码结构这个学习视频时做的简单笔记,感兴趣的可以直接去看原视频。
1 整体结构
1.1 模块
- Entrypoint (LLM, API server)
- Engine
- Scheduler
- KV cache manager
- Worker
- Model executor (Model runner)
- Modelling
- Attention backend
1.2 周边
- Preprocessing / Postprocessing (tokenizer, detokenizer, sampler, multimodal processor)
- Distributed
- ' torch.compile'
- Observability
- Config
- Testing
- CI / CD
- Formatting
1.3 优化
- Speculative decoding
- Disaggregated profiling
- Chunked prefetching
- Cascade inference
2 模块
2.1 Entrypoint--入口
对于新手来说,去阅读examples文件夹下面的这些例子是有帮助的。
./vllm/examples/offline_inference/basic/basic.py
python
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
"Hello, my name is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=16)
def main():
# Create an LLM.
llm = LLM(model="/volume/vllm_20260124/models/Qwen/Qwen3-8B/",tensor_parallel_size=1, dtype="float16",trust_remote_code=True, enforce_eager=True, block_size=16, enable_prefix_caching=False)
# Generate texts from the prompts.
# The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated_text!r}")
print("-" * 60)
if __name__ == "__main__":
main()
LLM类在这里可以看到vllm/vllm/entrypoints/llm.py
API server可以在这里看到vllm/vllm/entrypoints/api_server.py
2.2 engine
engine是用来干活的,vllm/vllm/engine
vllm/vllm/engine/llm_engine.py是真正干活的,而vllm/vllm/engine/async_llm_engine.py是套了一个异步的壳子。
2.3 schedule
schedule在vllm/vllm/core中,
经过一次模型inference的过程叫一个step,一个大语言模型的一个request会经过多次inference,产生一个字的过程就是一个inference。
schedule要做的事就是在一个step中放什么request,
2.4 KV Cache manager
代码在,vllm/vllm/core/block_manager.py
2.5 evictor
vllm/vllm/core/evictor.py
现在用的驱逐算法就是LRU,就是保存东西的时候发现空间不够了,那就把之前的驱逐掉。
这个在Prefix caching中用到的。
2.6 Worker
schedule当成导师,那么worker就是手底下的一个个博士,然后一个一个woker执行schedule的命令,然后最终成果返回给了schedule,
worker就是牛马,
vllm/vllm/worker
2.7 Model executor
worker初始化好环境,然后model executor真正用来运行模型,
vllm/vllm/model_executor
2.8 Modelling
2.9 Attention backend
vllm/vllm/attention/backends
vllm/vllm/attention/backends/flash_attn.py