【大模型推理】VLLM 引擎使用

EngineCore_DP0 与 RayWorkerWrapper 进程关系

进程关系概述

EngineCore_DP0RayWorkerWrapper 是主从关系:

  • EngineCore_DP0 :主控制器进程(pid=8336),负责协调和管理分布式推理 1
  • RayWorkerWrapper :工作进程(pid=14610),运行在 Ray 管理的进程中,执行具体模型推理任务 2

创建时序图

RayWorkerWrapper Ray框架 RayDistributedExecutor EngineCore_DP0 EngineCoreClient AsyncLLM 用户 RayWorkerWrapper Ray框架 RayDistributedExecutor EngineCore_DP0 EngineCoreClient AsyncLLM 用户 启动 vLLM make_async_mp_client() 启动进程 (pid=8336) 创建分布式执行器 初始化 Ray 集群 创建 placement group 创建 RayWorkerWrapper (pid=14610) 注册工作进程 分发任务

详细创建流程

1. EngineCore 创建

EngineCore 在独立进程中启动,通过 EngineCoreClient.make_async_mp_client() 创建 3

2. RayDistributedExecutor 初始化

EngineCore 创建 RayDistributedExecutor 实例 4 ,调用 _init_executor() 方法 5

3. Ray Worker 创建

_init_workers_ray() 中创建 RayWorkerWrapper 6

python 复制代码
worker = ray.remote(
    num_cpus=0,
    num_gpus=num_gpus,
    scheduling_strategy=scheduling_strategy,
)(RayWorkerWrapper).remote(rpc_rank=rank)
4. 进程通信
  • EngineCore 通过 ZMQ 与前端通信 7
  • 通过 Ray RPC 与 RayWorkerWrapper 通信 8

Notes

  • EngineCore_DP0 运行在独立进程中,实现资源隔离
  • RayWorkerWrapper 运行在 Ray 管理的进程内,每个 GPU 对应一个实例
  • 两个进程通过 Ray 的 RPC 机制协调工作,实现分布式推理 9

Wiki pages you might want to explore:

Wiki pages you might want to explore:

Citations

File: vllm/v1/engine/core.py (L79-109)

python 复制代码
class EngineCore:
    """Inner loop of vLLM's Engine."""

    def __init__(
        self,
        vllm_config: VllmConfig,
        executor_class: type[Executor],
        log_stats: bool,
        executor_fail_callback: Callable | None = None,
        include_finished_set: bool = False,
    ):
        # plugins need to be loaded at the engine/scheduler level too
        from vllm.plugins import load_general_plugins

        load_general_plugins()

        self.vllm_config = vllm_config
        if not vllm_config.parallel_config.data_parallel_rank_local:
            logger.info(
                "Initializing a V1 LLM engine (v%s) with config: %s",
                VLLM_VERSION,
                vllm_config,
            )

        self.log_stats = log_stats

        # Setup Model.
        self.model_executor = executor_class(vllm_config)
        if executor_fail_callback is not None:
            self.model_executor.register_failure_callback(executor_fail_callback)

File: vllm/v1/engine/core.py (L1457-1485)

python 复制代码
class EngineCoreActorMixin:
    """
    Ray actor for running EngineCore in a data parallel context
    """

    def __init__(
        self,
        vllm_config: VllmConfig,
        addresses: EngineZmqAddresses,
        dp_rank: int = 0,
        local_dp_rank: int = 0,
    ):
        self.addresses = addresses
        vllm_config.parallel_config.data_parallel_index = dp_rank
        vllm_config.parallel_config.data_parallel_rank_local = local_dp_rank

        # Set CUDA_VISIBLE_DEVICES as early as possible in actor life cycle
        # NOTE: in MP we set CUDA_VISIBLE_DEVICES at process creation time,
        # and this cannot be done in the same way for Ray because:
        # 1) Ray manages life cycle of all ray workers (including
        # DPEngineCoreActor)
        # 2) Ray sets CUDA_VISIBLE_DEVICES based on num_gpus configuration
        # To bypass 2, we need to also set
        # RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, but vLLM workers created
        # thereafter would have CUDA_VISIBLE_DEVICES set, which is sticky:
        # https://github.com/ray-project/ray/blob/e752fc319ddedd9779a0989b6d3613909bad75c9/python/ray/_private/worker.py#L456 # noqa: E501
        # This is problematic because when the vLLM worker (a Ray actor)
        # executes a task, it indexes into the sticky CUDA_VISIBLE_DEVICES
        # rather than directly using the GPU ID, potentially resulting in

File: vllm/v1/executor/ray_utils.py (L41-66)

python 复制代码
    class RayWorkerWrapper(WorkerWrapperBase):
        """Ray wrapper for vllm.worker.Worker, allowing Worker to be
        lazily initialized after Ray sets CUDA_VISIBLE_DEVICES."""

        def __init__(self, *args, **kwargs) -> None:
            super().__init__(*args, **kwargs)
            # Since the compiled DAG runs a main execution
            # in a different thread that calls cuda.set_device.
            # The flag indicates is set_device is called on
            # that thread.
            self.compiled_dag_cuda_device_set = False

        def get_node_ip(self) -> str:
            return get_ip()

        def get_node_and_gpu_ids(self) -> tuple[str, list[int]]:
            node_id = ray.get_runtime_context().get_node_id()
            device_key = vllm.platforms.current_platform.ray_device_key
            if not device_key:
                raise RuntimeError(
                    "current platform %s does not support ray.",
                    vllm.platforms.current_platform.device_name,
                )
            gpu_ids = ray.get_runtime_context().get_accelerator_ids()[device_key]
            return node_id, gpu_ids

File: vllm/v1/executor/ray_utils.py (L83-123)

python 复制代码
        def execute_model_ray(
            self,
            execute_model_input: tuple["SchedulerOutput", "GrammarOutput"]
            | tuple["SchedulerOutput", "GrammarOutput", "IntermediateTensors"],
        ) -> Union[
            "ModelRunnerOutput",
            tuple["SchedulerOutput", "GrammarOutput", "IntermediateTensors"],
        ]:
            # This method is used by Ray Compiled Graph to execute the model,
            # and it needs a special logic of self.setup_device_if_necessary()
            self.setup_device_if_necessary()
            assert self.worker is not None, "Worker is not initialized"
            if len(execute_model_input) == 3:
                scheduler_output, grammar_output, intermediate_tensors = (
                    execute_model_input
                )
            else:
                scheduler_output, grammar_output = execute_model_input
                intermediate_tensors = None
            assert self.worker.model_runner is not None
            output = self.worker.model_runner.execute_model(
                scheduler_output, intermediate_tensors
            )
            if isinstance(output, IntermediateTensors):
                return scheduler_output, grammar_output, output

            if isinstance(output, AsyncModelRunnerOutput):
                output = output.get_output()
            if not get_pp_group().is_last_rank:
                # Case where there are no scheduled requests
                # but may still be finished requests.
                assert not output or not output.req_ids
                output = scheduler_output, grammar_output, None
            elif output is None:
                output = self.worker.model_runner.sample_tokens(grammar_output)
                # Ensure outputs crossing Ray compiled DAG are serializable.
                # AsyncModelRunnerOutput holds CUDA events and cannot be
                # pickled.
                if isinstance(output, AsyncModelRunnerOutput):
                    output = output.get_output()
            return output

File: vllm/v1/executor/ray_executor.py (L80-97)

python 复制代码
    def _init_executor(self) -> None:
        self.forward_dag: ray.dag.CompiledDAG | None = None

        # For TPU or XPU, avoid compiling NVIDIA's NCCL
        if current_platform.is_tpu() or current_platform.is_xpu():
            os.environ["VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE"] = "shm"

        assert self.uses_ray
        initialize_ray_cluster(self.parallel_config)
        placement_group = self.parallel_config.placement_group

        # Disable Ray usage stats collection.
        ray_usage = os.environ.get("RAY_USAGE_STATS_ENABLED", "0")
        if ray_usage != "1":
            os.environ["RAY_USAGE_STATS_ENABLED"] = "0"

        # Create the parallel GPU workers.
        self._init_workers_ray(placement_group)

File: vllm/v1/executor/ray_executor.py (L154-221)

python 复制代码
    def _init_workers_ray(self, placement_group: "PlacementGroup", **ray_remote_kwargs):
        num_gpus = envs.VLLM_RAY_PER_WORKER_GPUS

        # The driver dummy worker does not actually use any resources.
        # It holds the resource for the driver worker.
        self.driver_dummy_worker: RayWorkerWrapper | None = None
        # The remaining workers are the actual ray actors.
        self.workers: list[RayWorkerWrapper] = []

        # Used in ray compiled DAG: indexed first by PP rank,
        # and then TP rank. In other words, the inner list is
        # the TP group of workers for a PP rank.
        self.pp_tp_workers: list[list[RayWorkerWrapper]] = []

        if self.parallel_config.ray_workers_use_nsight:
            ray_remote_kwargs = self._configure_ray_workers_use_nsight(
                ray_remote_kwargs
            )

        # Create the workers.
        bundle_indices: list[int]
        if envs.VLLM_RAY_BUNDLE_INDICES:
            # Use the bundle indices specified by the user.
            bundle_indices = list(map(int, envs.VLLM_RAY_BUNDLE_INDICES.split(",")))
            assert len(bundle_indices) == self.parallel_config.world_size, (
                "VLLM_RAY_BUNDLE_INDICES must have the same size"
                f" as the world size, but got {bundle_indices=} "
                f"and {self.parallel_config.world_size=}"
            )
            assert len(set(bundle_indices)) == len(bundle_indices), (
                "VLLM_RAY_BUNDLE_INDICES cannot have duplicate values,"
                f" but got {bundle_indices=}"
            )
        else:
            # use the first N bundles that have GPU resources.
            bundle_indices = []
            for bundle_id, bundle in enumerate(placement_group.bundle_specs):
                if bundle.get(current_platform.ray_device_key, 0):
                    bundle_indices.append(bundle_id)
            bundle_indices = bundle_indices[: self.parallel_config.world_size]

        worker_metadata: list[RayWorkerMetaData] = []
        driver_ip = get_ip()
        for rank, bundle_id in enumerate(bundle_indices):
            scheduling_strategy = PlacementGroupSchedulingStrategy(
                placement_group=placement_group,
                placement_group_capture_child_tasks=True,
                placement_group_bundle_index=bundle_id,
            )

            if current_platform.ray_device_key == "GPU":
                # NV+AMD GPUs, and Intel XPUs
                worker = ray.remote(
                    num_cpus=0,
                    num_gpus=num_gpus,
                    scheduling_strategy=scheduling_strategy,
                    **ray_remote_kwargs,
                )(RayWorkerWrapper).remote(rpc_rank=rank)
            else:
                worker = ray.remote(
                    num_cpus=0,
                    num_gpus=0,
                    resources={current_platform.ray_device_key: num_gpus},
                    scheduling_strategy=scheduling_strategy,
                    **ray_remote_kwargs,
                )(RayWorkerWrapper).remote(rpc_rank=rank)

            worker_metadata.append(RayWorkerMetaData(worker=worker, created_rank=rank))

File: vllm/v1/executor/ray_executor.py (L517-517)

python 复制代码
                "cupy is not installed but required since "

CoreEngineProcManager。EngineCoreClient SyncMPClient MPClient 之间的关系?




现在我明白了。让我解释为什么 SyncMPClient 需要调用父类 MPClient 的 init

为什么 SyncMPClient 需要调用父类的 init

  1. 父类 MPClient.init 完成的工作

MPClient.init (vllm/v1/engine/core_client.py:423-520) 执行了大量初始化工作:

def init (

self,

asyncio_mode: bool,

vllm_config: VllmConfig,

executor_class: type[Executor],

log_stats: bool,

client_addresses: dict[str, str] | None = None,

):

1. 存储配置

self.vllm_config = vllm_config

复制代码
   # 2. 设置序列化器
   self.encoder = MsgpackEncoder()
   self.decoder = MsgpackDecoder(EngineCoreOutputs)

   # 3. 创建 ZMQ 上下文
   sync_ctx = zmq.Context(io_threads=2)
   self.ctx = zmq.asyncio.Context(sync_ctx) if asyncio_mode else sync_ctx

   # 4. 创建资源管理器(用于清理)
   self.resources = BackgroundResources(ctx=sync_ctx)
   self._finalizer = weakref.finalize(self, self.resources)

   # 5. 启动 EngineCore 进程(通过 launch_core_engines)
   with launch_core_engines(vllm_config, executor_class, log_stats) as (
       engine_manager,
       coordinator,
       addresses,
   ):
       self.resources.coordinator = coordinator
       self.resources.engine_manager = engine_manager

   # 6. 创建 ZMQ sockets
   self.input_socket = make_zmq_socket(self.ctx, input_address, zmq.ROUTER, bind=True)
   self.resources.output_socket = make_zmq_socket(self.ctx, output_address, zmq.PULL)

   # 7. 设置数据并行相关状态
   self.engine_ranks_managed = [...]
   self.core_engines = [...]
   self.utility_results = {}
  1. 子类 SyncMPClient.init 需要父类的初始化结果

SyncMPClient.init (vllm/v1/engine/core_client.py:642-690) 依赖父类初始化的属性:

def init (

self, vllm_config: VllmConfig, executor_class: type[Executor], log_stats: bool

):

调用父类 init ,传入 asyncio_mode=False

super().init (

asyncio_mode=False, # 同步模式

vllm_config=vllm_config,

executor_class=executor_class,

log_stats=log_stats,

)

复制代码
   # 使用父类初始化的属性
   self.is_dp = self.vllm_config.parallel_config.data_parallel_size > 1  # 使用 self.vllm_config
   self.outputs_queue = queue.Queue[EngineCoreOutputs | Exception]()

   # 依赖父类初始化的属性
   ctx = self.ctx  # 父类创建的 ZMQ 上下文
   out_socket = self.resources.output_socket  # 父类创建的输出 socket
   decoder = self.decoder  # 父类创建的解码器
   utility_results = self.utility_results  # 父类创建的字典

   # 创建后台线程处理输出
   def process_outputs_socket():
       # 使用 out_socket, decoder, utility_results
       ...
  1. 不调用父类 init 的后果

如果不调用 super().init(),会导致:

  1. self.vllm_config 未定义 → 访问时抛出 AttributeError

  2. self.ctx 未定义 → 无法创建 ZMQ socket

  3. self.resources 未定义 → 无法访问 output_socket

  4. self.decoder 未定义 → 无法解码输出消息

  5. self.utility_results 未定义 → 无法存储工具调用结果

  6. EngineCore 进程未启动 → 没有后台进程处理请求

  7. 为什么使用 super().init()

这是 Python 面向对象编程的标准做法:

class SyncMPClient(MPClient):

def init (self, vllm_config, executor_class, log_stats):

参数适配:SyncMPClient 不需要 asyncio_mode 参数

但父类需要,所以在这里硬编码为 False

super().init (

asyncio_mode=False, # 明确指定为同步模式

vllm_config=vllm_config,

executor_class=executor_class,

log_stats=log_stats,

)

复制代码
       # 子类特有的初始化
       self.outputs_queue = queue.Queue()
       # ... 其他同步模式特有的设置
  1. 设计优势

这种设计的好处:

  1. 代码复用:MPClient 的初始化逻辑被 SyncMPClient 和 AsyncMPClient 共享
  2. 职责分离:
    • MPClient:负责通用的多进程客户端初始化(ZMQ、EngineCore 启动)
    • SyncMPClient:负责同步模式特有的初始化(队列、后台线程)
  3. 易于维护:修改通用逻辑只需在 MPClient 中修改一次
  4. 类型安全:通过 asyncio_mode 参数控制创建同步或异步的 ZMQ 上下文

总结

SyncMPClient 调用父类 MPClient 的 init 是因为:

  1. 父类完成了核心初始化:ZMQ 上下文、EngineCore 进程启动、socket 创建
  2. 子类依赖父类的属性:ctx、resources、decoder、utility_results 等
  3. 遵循继承原则:子类应该确保父类正确初始化,这是面向对象编程的基本要求

✦ 如果不调用父类的 init,SyncMPClient 将无法正常工作,因为缺少必要的初始化状态。

相关推荐
LateFrames3 小时前
动画性能比对:WPF / WinUI3 / WebView2
wpf·webview·用户体验·winui3
熊明才21 小时前
DeepSeek-OCR VLLM 环境配置指南
ocr·vllm
阿湯哥1 天前
多智能体架构深度解析:企业落地如何选择Skills与SubAgents?
架构·wpf
源之缘-OFD先行者1 天前
自研 WPF 鸟情图表:性能与灵活的双重突破
wpf
Moqiqiuzi1 天前
WPF单实例启动
wpf
Moqiqiuzi1 天前
WPF程序打包成安装包的方法
wpf
码农水水2 天前
国家电网Java面试被问:TCP的BBR拥塞控制算法原理
java·开发语言·网络·分布式·面试·wpf
码农水水2 天前
京东Java面试被问:HTTP/2的多路复用和头部压缩实现
java·开发语言·分布式·http·面试·php·wpf
HyperAI超神经2 天前
【vLLM 学习】Rlhf Utils
人工智能·深度学习·学习·机器学习·ai编程·vllm