【RL]分离部署与共置模式详解

分离部署与共置模式详解

核心概念对比

**分离部署（Separated Mode）和共置模式（Colocated Mode）**是ROLL框架中两种不同的资源部署策略，主要区别在于不同角色是否共享GPU资源。

模式定义与判断

1. 共置模式判断逻辑

框架通过is_colocated函数判断两个worker是否共置 1 ：

python 复制代码

def is_colocated(actor_train: WorkerConfig, actor_infer: WorkerConfig):
    train_devices = set(actor_train.device_mapping or [])
    infer_devices = set(actor_infer.device_mapping or [])
    if train_devices.issuperset(infer_devices):
        return True
    # 部分重叠的情况返回False
    return False

2. Pipeline级别的判断

在PPOConfig中提供便捷属性 2 ：

python 复制代码

@property
def is_train_infer_colocated(self) -> bool:
    """Whether actor_train and actor_infer are colocated."""
    return is_colocated(self.actor_train, self.actor_infer)

配置方式对比

共置模式配置

多个角色共享相同GPU资源 3 ：

yaml 复制代码

actor_train:
  device_mapping: list(range(0,8))
actor_infer:
  device_mapping: list(range(0,8))  # 与train共享
reference:
  device_mapping: list(range(0,8))  # 与train共享

分离模式配置

不同角色使用独立GPU资源 4 ：

yaml 复制代码

actor_train:
  device_mapping: list(range(0,8))   # 使用GPU 0-7
actor_infer:
  device_mapping: list(range(8,16))  # 使用GPU 8-15
reference:
  device_mapping: list(range(16,24)) # 使用GPU 16-23

实现机制差异

1. 状态管理策略

共置模式下的状态卸载：

在vLLM策略中，共置模式会自动卸载模型状态以避免内存冲突 5 ：

python 复制代码

def offload_states(self, include=None, non_blocking=False):
    if include is None or OffloadStateType.model_params in include:
        if self.is_model_in_gpu and self.worker.pipeline_config.is_train_infer_colocated:
            self.model.offload_states(self.sleep_level)
            self.is_model_in_gpu = False

分离模式下无需状态卸载：

由于GPU资源独立，不同角色可以同时保持模型状态在GPU上。

2. 异步训练支持

分离模式是实现异步训练的关键 6 ：

yaml 复制代码

# 异步训练配置示例
actor_train:
  device_mapping: list(range(0,4))   # 训练使用GPU 0-3
actor_infer:
  device_mapping: list(range(4,8))   # 推理使用GPU 4-7
  strategy_args:
    strategy_name: sglang

执行流程对比

共置模式执行流程

actor_infer actor_train Pipeline actor_infer actor_train Pipeline 时分复用执行 offload_states() load_states() generate() offload_states() load_states() train_step()

分离模式执行流程

actor_infer actor_train Pipeline actor_infer actor_train Pipeline 并行执行同时进行，无资源冲突 train_step() generate()

关键技术实现

1. GPU时分复用

在共置模式下，框架通过状态卸载/加载实现GPU时分复用 7 ：

自动控制：RPC调用时自动管理状态
手动控制 ：通过batch.meta_info["is_offload_states"]控制
上下文管理器 ：使用state_offload_manager简化管理

2. 资源分配计算

Worker数量根据device_mapping和num_gpus_per_worker自动计算 8 ：

python 复制代码

world_size = len(device_mapping) // num_gpus_per_worker

使用场景建议

共置模式适用场景

资源有限环境：GPU数量不足时最大化利用率
小规模实验：快速原型验证
成本敏感场景：减少硬件投入

分离模式适用场景

大规模生产：需要高吞吐量和低延迟
异步训练：推理和训练并行执行
多租户环境：不同服务隔离部署

Notes

共置模式通过时分复用提高资源利用率，但会增加状态切换开销
分离模式提供更好的性能和隔离性，但需要更多GPU资源
框架自动检测部署模式并应用相应的优化策略
异步训练功能要求必须使用分离部署模式

Wiki pages you might want to explore:

RLVR Pipeline (alibaba/ROLL)

Citations

File: roll/configs/worker_config.py (L196-207)

python 复制代码

def is_colocated(actor_train: WorkerConfig, actor_infer: WorkerConfig):
    train_devices = set(actor_train.device_mapping or [])
    infer_devices = set(actor_infer.device_mapping or [])
    if train_devices.issuperset(infer_devices):
        return True
    if train_devices.intersection(infer_devices):
        # TODO: raise here
        # raise ValueError(
        #     f"train and infer share some devices, but train not cover infer. {train_devices=} {infer_devices=}"
        # )
        return False
    return False

File: roll/configs/base_config.py (L507-509)

python 复制代码

    def is_train_infer_colocated(self) -> bool:
        """Whether actor_train and actor_infer are colocated."""
        return is_colocated(self.actor_train, self.actor_infer)

File: docs_roll/docs/Getting Started/FAQ/qa_issues.md (L17-30)

markdown 复制代码

### What is colocate mode?

In colocate mode, multiple roles (such as `actor_train`, `actor_infer`, `reference`) can reuse the same GPU devices in their `device_mapping`. For example:

```yaml
actor_train:
  device_mapping: list(range(0,8))
actor_infer:
  device_mapping: list(range(0,8))
reference:
  device_mapping: list(range(0,8))

The framework's underlying resource management mechanism ensures GPU reuse between multiple roles, improving resource utilization.

复制代码

**File:** docs_roll/docs/Getting Started/FAQ/qa_issues.md (L32-45)
```markdown
### What is separate mode?

In separate mode, there is no intersection between different roles' `device_mapping`, and each role holds a set of independent GPU device resources. For example:

```yaml
actor_train:
  device_mapping: list(range(0,8))
actor_infer:
  device_mapping: list(range(8,16))
reference:
  device_mapping: list(range(16,24))

This approach can avoid resource competition between roles and improve system stability.

复制代码

**File:** roll/distributed/strategy/vllm_strategy.py (L422-428)
```python
    def offload_states(self, include=None, non_blocking=False):
        if include is None or OffloadStateType.model_params in include:
            if self.is_model_in_gpu and self.worker.pipeline_config.is_train_infer_colocated:
                self.model.offload_states(self.sleep_level)
                self.is_model_in_gpu = False
        gc.collect()
        current_platform.empty_cache()

File: docs_roll/docs/User Guides/Advanced Features/async_training.md (L129-168)

markdown 复制代码

actor_train:
  model_args:
    dtype: bf16
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 64
    warmup_steps: 1
  data_args:
    template: qwen2_5
    file_name:
      - data/math_deepmath_deal.jsonl
  strategy_args:
    strategy_name: megatron_train
    strategy_config:
      tensor_model_parallel_size: 2
      pipeline_model_parallel_size: 1
      sequence_parallel: true
      use_distributed_optimizer: true
  device_mapping: list(range(0,16))
  infer_batch_size: 2

actor_infer:
  model_args:
    dtype: fp16
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: ${num_return_sequences_in_group}
  strategy_args:
    strategy_name: sglang
    strategy_config:
      mem_fraction_static: 0.85
      load_format: dummy
  device_mapping: list(range(16,24))
  infer_batch_size: 1

File: docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Advanced Features/offload_reload_control.md (L5-33)

markdown 复制代码

## 时分复用概述

在 ROLL 框架中，不同的角色（如 actor_train、actor_infer、critic、reference 和 rewards）可能需要使用相同的 GPU 资源。为了提高资源利用率，框架实现了 GPU 时分复用功能，允许在不同时间点将模型状态在 GPU 和 CPU 之间进行切换。

## Offload/Reload 控制机制

### 自动控制

以RLVRPipeline为例，框架会自动管理模型状态的 offload 和 reload：

```python
# 在 rlvr_pipeline.py 中的示例
ref_log_probs = self.reference.compute_log_probs(batch, blocking=True)

默认情况下，执行对worker的RPC调用时，框架会先将当前 worker 的GPU有关的state reload 到 GPU 上，执行完成后会将state offload 到内存上。

手动控制

您也可以通过设置 batch.meta_info["is_offload_states"] 来手动干预模型状态：

python 复制代码

# 在 rlvr_pipeline.py 中的示例
self.actor_train.offload_states(blocking=True)

当设置 is_offload_states 为 False 时，RPC 调用完成后不会自动 offload 模型状态到 CPU，模型会继续保留在 GPU 上。

也可以直接使用worker.offload_states()和worker.reload_states()来更加直接地控制offload和reload时机。

复制代码

**File:** docs_roll/docs/User Guides/Configuration/device_mapping.md (L78-90)
```markdown
## Worker Count Calculation

The number of workers (`world_size`) is automatically calculated based on the `device_mapping` and `num_gpus_per_worker` parameters:

```python
world_size = len(device_mapping) // num_gpus_per_worker

In the WorkerConfig.__post_init__() method, if device_mapping is not None, the following logic is executed:

Parse the string into a list through eval(device_mapping)
Verify that len(device_mapping) is divisible by num_gpus_per_worker
Calculate world_size = len(device_mapping) // num_gpus_per_worker

复制代码