分离部署与共置模式详解
核心概念对比
**分离部署(Separated Mode)和共置模式(Colocated Mode)**是ROLL框架中两种不同的资源部署策略,主要区别在于不同角色是否共享GPU资源。
模式定义与判断
1. 共置模式判断逻辑
框架通过is_colocated函数判断两个worker是否共置 1 :
python
def is_colocated(actor_train: WorkerConfig, actor_infer: WorkerConfig):
train_devices = set(actor_train.device_mapping or [])
infer_devices = set(actor_infer.device_mapping or [])
if train_devices.issuperset(infer_devices):
return True
# 部分重叠的情况返回False
return False
2. Pipeline级别的判断
在PPOConfig中提供便捷属性 2 :
python
@property
def is_train_infer_colocated(self) -> bool:
"""Whether actor_train and actor_infer are colocated."""
return is_colocated(self.actor_train, self.actor_infer)
配置方式对比
共置模式配置
多个角色共享相同GPU资源 3 :
yaml
actor_train:
device_mapping: list(range(0,8))
actor_infer:
device_mapping: list(range(0,8)) # 与train共享
reference:
device_mapping: list(range(0,8)) # 与train共享
分离模式配置
不同角色使用独立GPU资源 4 :
yaml
actor_train:
device_mapping: list(range(0,8)) # 使用GPU 0-7
actor_infer:
device_mapping: list(range(8,16)) # 使用GPU 8-15
reference:
device_mapping: list(range(16,24)) # 使用GPU 16-23
实现机制差异
1. 状态管理策略
共置模式下的状态卸载:
在vLLM策略中,共置模式会自动卸载模型状态以避免内存冲突 5 :
python
def offload_states(self, include=None, non_blocking=False):
if include is None or OffloadStateType.model_params in include:
if self.is_model_in_gpu and self.worker.pipeline_config.is_train_infer_colocated:
self.model.offload_states(self.sleep_level)
self.is_model_in_gpu = False
分离模式下无需状态卸载:
由于GPU资源独立,不同角色可以同时保持模型状态在GPU上。
2. 异步训练支持
分离模式是实现异步训练的关键 6 :
yaml
# 异步训练配置示例
actor_train:
device_mapping: list(range(0,4)) # 训练使用GPU 0-3
actor_infer:
device_mapping: list(range(4,8)) # 推理使用GPU 4-7
strategy_args:
strategy_name: sglang
执行流程对比
共置模式执行流程
actor_infer actor_train Pipeline actor_infer actor_train Pipeline 时分复用执行 offload_states() load_states() generate() offload_states() load_states() train_step()
分离模式执行流程
actor_infer actor_train Pipeline actor_infer actor_train Pipeline 并行执行 同时进行,无资源冲突 train_step() generate()
关键技术实现
1. GPU时分复用
在共置模式下,框架通过状态卸载/加载实现GPU时分复用 7 :
- 自动控制:RPC调用时自动管理状态
- 手动控制 :通过
batch.meta_info["is_offload_states"]控制 - 上下文管理器 :使用
state_offload_manager简化管理
2. 资源分配计算
Worker数量根据device_mapping和num_gpus_per_worker自动计算 8 :
python
world_size = len(device_mapping) // num_gpus_per_worker
使用场景建议
共置模式适用场景
- 资源有限环境:GPU数量不足时最大化利用率
- 小规模实验:快速原型验证
- 成本敏感场景:减少硬件投入
分离模式适用场景
- 大规模生产:需要高吞吐量和低延迟
- 异步训练:推理和训练并行执行
- 多租户环境:不同服务隔离部署
Notes
- 共置模式通过时分复用提高资源利用率,但会增加状态切换开销
- 分离模式提供更好的性能和隔离性,但需要更多GPU资源
- 框架自动检测部署模式并应用相应的优化策略
- 异步训练功能要求必须使用分离部署模式
Wiki pages you might want to explore:
Wiki pages you might want to explore:
Citations
File: roll/configs/worker_config.py (L196-207)
python
def is_colocated(actor_train: WorkerConfig, actor_infer: WorkerConfig):
train_devices = set(actor_train.device_mapping or [])
infer_devices = set(actor_infer.device_mapping or [])
if train_devices.issuperset(infer_devices):
return True
if train_devices.intersection(infer_devices):
# TODO: raise here
# raise ValueError(
# f"train and infer share some devices, but train not cover infer. {train_devices=} {infer_devices=}"
# )
return False
return False
File: roll/configs/base_config.py (L507-509)
python
def is_train_infer_colocated(self) -> bool:
"""Whether actor_train and actor_infer are colocated."""
return is_colocated(self.actor_train, self.actor_infer)
File: docs_roll/docs/Getting Started/FAQ/qa_issues.md (L17-30)
markdown
### What is colocate mode?
In colocate mode, multiple roles (such as `actor_train`, `actor_infer`, `reference`) can reuse the same GPU devices in their `device_mapping`. For example:
```yaml
actor_train:
device_mapping: list(range(0,8))
actor_infer:
device_mapping: list(range(0,8))
reference:
device_mapping: list(range(0,8))
The framework's underlying resource management mechanism ensures GPU reuse between multiple roles, improving resource utilization.
**File:** docs_roll/docs/Getting Started/FAQ/qa_issues.md (L32-45)
```markdown
### What is separate mode?
In separate mode, there is no intersection between different roles' `device_mapping`, and each role holds a set of independent GPU device resources. For example:
```yaml
actor_train:
device_mapping: list(range(0,8))
actor_infer:
device_mapping: list(range(8,16))
reference:
device_mapping: list(range(16,24))
This approach can avoid resource competition between roles and improve system stability.
**File:** roll/distributed/strategy/vllm_strategy.py (L422-428)
```python
def offload_states(self, include=None, non_blocking=False):
if include is None or OffloadStateType.model_params in include:
if self.is_model_in_gpu and self.worker.pipeline_config.is_train_infer_colocated:
self.model.offload_states(self.sleep_level)
self.is_model_in_gpu = False
gc.collect()
current_platform.empty_cache()
File: docs_roll/docs/User Guides/Advanced Features/async_training.md (L129-168)
markdown
actor_train:
model_args:
dtype: bf16
training_args:
learning_rate: 1.0e-6
weight_decay: 0
per_device_train_batch_size: 1
gradient_accumulation_steps: 64
warmup_steps: 1
data_args:
template: qwen2_5
file_name:
- data/math_deepmath_deal.jsonl
strategy_args:
strategy_name: megatron_train
strategy_config:
tensor_model_parallel_size: 2
pipeline_model_parallel_size: 1
sequence_parallel: true
use_distributed_optimizer: true
device_mapping: list(range(0,16))
infer_batch_size: 2
actor_infer:
model_args:
dtype: fp16
generating_args:
max_new_tokens: ${response_length}
top_p: 0.99
top_k: 100
num_beams: 1
temperature: 0.99
num_return_sequences: ${num_return_sequences_in_group}
strategy_args:
strategy_name: sglang
strategy_config:
mem_fraction_static: 0.85
load_format: dummy
device_mapping: list(range(16,24))
infer_batch_size: 1
File: docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Advanced Features/offload_reload_control.md (L5-33)
markdown
## 时分复用概述
在 ROLL 框架中,不同的角色(如 actor_train、actor_infer、critic、reference 和 rewards)可能需要使用相同的 GPU 资源。为了提高资源利用率,框架实现了 GPU 时分复用功能,允许在不同时间点将模型状态在 GPU 和 CPU 之间进行切换。
## Offload/Reload 控制机制
### 自动控制
以RLVRPipeline为例,框架会自动管理模型状态的 offload 和 reload:
```python
# 在 rlvr_pipeline.py 中的示例
ref_log_probs = self.reference.compute_log_probs(batch, blocking=True)
默认情况下,执行对worker的RPC调用时,框架会先将当前 worker 的GPU有关的state reload 到 GPU 上,执行完成后会将state offload 到内存上。
手动控制
您也可以通过设置 batch.meta_info["is_offload_states"] 来手动干预模型状态:
python
# 在 rlvr_pipeline.py 中的示例
self.actor_train.offload_states(blocking=True)
当设置 is_offload_states 为 False 时,RPC 调用完成后不会自动 offload 模型状态到 CPU,模型会继续保留在 GPU 上。
也可以直接使用worker.offload_states()和worker.reload_states()来更加直接地控制offload和reload时机。
**File:** docs_roll/docs/User Guides/Configuration/device_mapping.md (L78-90)
```markdown
## Worker Count Calculation
The number of workers (`world_size`) is automatically calculated based on the `device_mapping` and `num_gpus_per_worker` parameters:
```python
world_size = len(device_mapping) // num_gpus_per_worker
In the WorkerConfig.__post_init__() method, if device_mapping is not None, the following logic is executed:
- Parse the string into a list through
eval(device_mapping) - Verify that
len(device_mapping)is divisible bynum_gpus_per_worker - Calculate
world_size = len(device_mapping) // num_gpus_per_worker