vllm -- 源码剖析5 (serve启动方式--headless) 小白级教程

声明: 纯兴趣爱好,如有疏漏敬请谅解。

源码版本: v0.21.0

学习相关源码路径:

vllm/vllm/entrypoints/cli/serve.py at v0.21.0 · vllm-project/vllm · GitHub

概述:

今天来详细剖析,serve三种启动方式之一的headless:

  1. run_headless:

无头模式入口函数(api_server_count小于1时)

启动命令: vllm serve --headless

但是会出现报错RuntimeError: Did not receive response from front-end process within 5 minutes,目前还没找错误原因。哪位大佬知道可以指导我一下。

  1. run_multi_api_server:

多API服务进程启动入口(api_server_count大于1时)

启动命令: vllm serve --api-server-count=2

  1. uvloop.run(run_server(args))

单实例HTTP API服务顶层启动入口(api_server_count等于1时)

启动命令: vllm serve

先回顾下上篇文章(如下图),启动方式由以下参数决定

data_parallel_exteran_lb、 data_parallel_rank参数决定了is_exteranl_lb

data_parallel_hybrid_lb、 data_parallel_start_rank 参数决定了is_hybrid_lb

(注: is_exteranl_lb、is_hybrid_lb互斥, 如果is_exteranl_lb、is_hybrid_lb同时为true则程序中断 )

enable_elastic_ep: 对api_server_count进行修正

headless函数源码剖析

  1. api_server_count验证

    if args.api_server_count > 1:
    raise ValueError("api_server_count can't be set in headless mode")

  2. 创建运行配置vllm_config

    Create the EngineConfig.

    engine_args = vllm.AsyncEngineArgs.from_cli_args(args)
    usage_context = UsageContext.OPENAI_API_SERVER
    vllm_config = engine_args.create_engine_config(
    usage_context=usage_context, headless=True
    )

  3. data_parallel_hybrid_lb验证

    if engine_args.data_parallel_hybrid_lb:
    raise ValueError("data_parallel_hybrid_lb is not applicable in headless mode")

4.获取并行配置 local_engine_count参数验证

复制代码
parallel_config = vllm_config.parallel_config
local_engine_count = parallel_config.data_parallel_size_local

if local_engine_count <= 0:
    raise ValueError("data_parallel_size_local must be > 0 in headless mode")
  1. 注册终止信号回调函数,实现优雅关闭

    shutdown_requested = False

    def signal_handler(signum, frame):
    nonlocal shutdown_requested
    logger.debug("Received %d signal.", signum)
    if not shutdown_requested:
    shutdown_requested = True
    raise SystemExit

    signal.signal(signal.SIGTERM, signal_handler)
    signal.signal(signal.SIGINT, signal_handler)

  2. 分布式从节点逻辑

单机部署永远不会触发

复制代码
if parallel_config.node_rank_within_dp > 0:
    from vllm.version import __version__ as VLLM_VERSION

    # Run headless workers (for multi-node PP/TP).
    host = parallel_config.master_addr
    head_node_address = f"{host}:{parallel_config.master_port}"
    logger.info(
        "Launching vLLM (v%s) headless multiproc executor, "
        "with head node address %s for torch.distributed process group.",
        VLLM_VERSION,
        head_node_address,
    )

    executor = MultiprocExecutor(vllm_config, monitor_workers=False)
    executor.start_worker_monitor(inline=True)
    return

7.实例化 CoreEngineProcManager 引擎进程管理器

专门负责创建、管理、监控所有本地 DP 推理 Worker 子进程。

复制代码
host = parallel_config.data_parallel_master_ip
port = parallel_config.data_parallel_rpc_port
handshake_address = get_tcp_uri(host, port)

logger.info(
    "Launching %d data parallel engine(s) in headless mode, "
    "with head node address %s.",
    local_engine_count,
    handshake_address,
)

# Create the engines.
engine_manager = CoreEngineProcManager(
    local_engine_count=local_engine_count,
    start_index=vllm_config.parallel_config.data_parallel_rank,
    local_start_index=0,
    vllm_config=vllm_config,
    local_client=False,
    handshake_address=handshake_address,
    executor_class=Executor.get_class(vllm_config),
    log_stats=not engine_args.disable_log_stats,
)

8.启动监听

复制代码
try:
    engine_manager.monitor_engine_liveness()
finally:
    timeout = None
    if shutdown_requested:
        timeout = vllm_config.shutdown_timeout
        logger.info("Waiting up to %d seconds for processes to exit", timeout)
    engine_manager.shutdown(timeout=timeout)
    logger.info("Shutting down.")

无限阻塞循环,主线程全程卡在这一行,是程序常驻运行的核心;持续维护父子进程 TCP 心跳(127.0.0.1:29550),实时监控所有 Worker 子进程存活状态;

本来想一次性写完 headless、multi_api_server、uvloop.run三种启动方式,奈何最近项目太忙,其他两种方式下期整理。

知识点

Python嵌套函数

复制代码
def outer():
    # 外层函数
    x = 10

    # 内层函数,定义在 outer 里面
    def inner():
        print(x)  # 可以访问外层变量
    # 调用内层函数
    inner()

outer()  # 输出 10

优点

封装:工具逻辑只在当前函数可见,命名不污染全局;

闭包可捕获上层变量,不用反复传参;

代码内聚,逻辑就近存放,可读性高。

缺点

多层嵌套可读性变差(不建议超过 2 层);

调试栈会多一层,复杂嵌套排错稍麻烦。

nonlocal关键字

nonlocal 专门用在内层嵌套函数,代表: 变量不在当前函数局部作用域,也不是全局global,而是取自外层上层函数的局部变量。