【推荐系统】深度学习训练框架（二）：深入剖析Spark Cluster模式下DDP网络配置解析

Spark Cluster模式下DDP网络配置解析

问题的核心

在Spark cluster模式下，executor是动态分配的，这引发了一个问题：

DDP需要master_addr和master_port
但我们怎么知道executor的IP？
端口会不会冲突？

关键理解：DDP进程都在同一个Executor上

Spark Executor架构

复制代码

Spark Cluster
├── Executor 1 (随机分配，IP未知)
│   ├── Spark Task 1 → 运行spark_train_ddp_wrapper.py
│   │   ├── Process 0 (DDP rank 0)
│   │   ├── Process 1 (DDP rank 1)
│   │   ├── Process 2 (DDP rank 2)
│   │   └── Process 3 (DDP rank 3)
│   └── 所有进程都在同一executor上
│
├── Executor 2 (随机分配，IP未知)
│   └── Spark Task 2 → 运行spark_train_ddp_wrapper.py
│       ├── Process 0 (DDP rank 0)
│       ├── Process 1 (DDP rank 1)
│       ├── Process 2 (DDP rank 2)
│       └── Process 3 (DDP rank 3)
│
└── Executor 3 ...

关键点 ：每个executor上的DDP进程都是独立的训练实例，它们不需要相互通信。

为什么可以使用localhost？

单Executor内的DDP通信

在单个executor内部，所有DDP进程：

运行在同一台机器上（同一个executor）
通过本地回环接口（127.0.0.1 / localhost）通信
不需要知道executor的外部IP

Executor内部（IP=10.0.0.5，但我们不需要知道）
├── Process 0 → 连接 localhost:23456
├── Process 1 → 连接 localhost:23456
├── Process 2 → 连接 localhost:23456
└── Process 3 → 连接 localhost:23456
↑
通过本地回环接口通信
（127.0.0.1）

端口选择策略

虽然executor是动态分配的，但：

端口范围冲突概率低
- 选择非常用端口（23456）
- executor在隔离环境运行
每个executor独立训练
- Executor 1运行训练A（端口23456）
- Executor 2运行训练B（端口23456）
- 它们互不干扰（不同容器）
隔离性保证
- 每个executor有独立网络命名空间
- localhost:23456只在executor内部有效
- 不会冲突

工作原理详解

启动流程

python 复制代码

# spark_train_ddp_wrapper.py 在executor上运行
def main():
    # 1. Spark将这个脚本提交到某个executor
    # 2. Executor的IP是什么？我们不知道，也不需要知道
    
    torchrun_cmd = [
        sys.executable,
        '-m', 'torch.distributed.run',
        '--nproc_per_node', '4',        # 在同一executor上启动4个进程
        '--nnodes', '1',                 # 只有1个节点（这个executor）
        '--node_rank', '0',              # 节点rank=0
        '--master_addr', 'localhost',    # 本地回环接口
        '--master_port', '23456',       # 固定端口
        'spark_train.py'                 # 实际的训练脚本
    ]
    
    # 3. torchrun在executor上执行
    subprocess.run(torchrun_cmd)

torchrun的工作机制

当torchrun启动时：

bash 复制代码

# torchrun在executor内部执行
# Executor IP = 10.0.0.5 (假设，但我们不需要知道)

# 第1步：torchrun启动master进程
# Process 0 (rank 0) 启动，监听 localhost:23456

# 第2步：torchrun启动其他进程
# Process 1 (rank 1) 连接 localhost:23456
# Process 2 (rank 2) 连接 localhost:23456
# Process 3 (rank 3) 连接 localhost:23456

# 所有进程通过localhost通信
# ✅ 不需要知道executor的外部IP
# ✅ 端口只在executor内部使用

实际网络拓扑

复制代码

┌─────────────────────────────────────────┐
│ Executor Container (动态分配)            │
│ IP: 10.0.0.5 (我们不需要知道)           │
│                                         │
│  ┌───────────────────────────────────┐ │
│  │ localhost:23456                   │ │
│  │                                   │ │
│  │  Process 0 (rank 0) ←─┐         │ │
│  │  Process 1 (rank 1) ←─┼──→ 通信  │ │
│  │  Process 2 (rank 2) ←─┤         │ │
│  │  Process 3 (rank 3) ←─┘         │ │
│  └───────────────────────────────────┘ │
│                                         │
│  所有通信都在容器内部进行               │
│  不涉及外部网络                         │
└─────────────────────────────────────────┘

多个Executor的隔离性

场景：有3个Executor同时运行训练

复制代码

Spark Cluster
│
├─ Executor 1 (随机IP，如10.0.0.5)
│  └─ Training A
│     ├─ Process 0 连接 localhost:23456
│     ├─ Process 1 连接 localhost:23456
│     ├─ Process 2 连接 localhost:23456
│     └─ Process 3 连接 localhost:23456
│     ✅ 端口23456只在Executor 1内部使用
│
├─ Executor 2 (随机IP，如10.0.0.6)
│  └─ Training B
│     ├─ Process 0 连接 localhost:23456
│     ├─ Process 1 连接 localhost:23456
│     ├─ Process 2 连接 localhost:23456
│     └─ Process 3 连接 localhost:23456
│     ✅ 端口23456只在Executor 2内部使用
│
└─ Executor 3 (随机IP，如10.0.0.7)
   └─ Training C
      ├─ Process 0 连接 localhost:23456
      ├─ Process 1 连接 localhost:23456
      ├─ Process 2 连接 localhost:23456
      └─ Process 3 连接 localhost:23456
      ✅ 端口23456只在Executor 3内部使用

为什么不会冲突？

网络隔离：每个executor有独立的网络命名空间
localhost的作用域：localhost只在executor内部有效
端口独立性：不同executor的23456端口互不干扰

与多节点训练的区别

多节点训练（需要知道Master IP）

python 复制代码

# Node 0: Master节点
torchrun_cmd = [
    '--master_addr', '10.0.0.100',  # Master节点的实际IP
    '--master_port', '23456',
]

# Node 1: Worker节点
torchrun_cmd = [
    '--master_addr', '10.0.0.100',  # 连接到Master节点
    '--master_port', '23456',
]

为什么需要知道IP？

节点在不同的机器上
需要通过网络连接
必须知道Master的IP地址

单节点（我们的场景）

python 复制代码

torchrun_cmd = [
    '--master_addr', 'localhost',  # 本地回环
    '--master_port', '23456',
]

为什么不需要知道IP？

所有进程在同一台机器（executor）上
通过本地回环接口通信
不需要外部IP地址

端口冲突的实际情况

可能发生的情况

虽然理论上有冲突风险，但实际：

情况1：同一Executor内

bash 复制代码

# 不会冲突：同一个进程中
python train.py  # 用端口23456

情况2：不同Executor

bash 复制代码

# 不会冲突：不同的容器
Executor A: 端口23456  # 在容器A内部
Executor B: 端口23456  # 在容器B内部，互不干扰

情况3：同一机器上的不同进程

bash 复制代码

# 可能冲突：在同一台机器的不同进程中
Process A: 使用端口23456
Process B: 使用端口23456  # ❌ 冲突

解决方案：让torchrun自动分配端口

python 复制代码

# 不指定固定端口，让torchrun自动选择
torchrun_cmd = [
    sys.executable,
    '-m', 'torch.distributed.run',
    '--nproc_per_node', str(num_processes),
    '--nnodes', '1',
    '--node_rank', '0',
    # 不指定master_port，让torchrun自动分配
    spark_train_script
]

最佳实践

方案1：固定端口（当前实现）

python 复制代码

'--master_addr', 'localhost',
'--master_port', '23456',

优点：

简单明了
容易调试
日志清晰

缺点：

理论上可能端口冲突
需要确保端口可用

方案2：自动端口（推荐）

python 复制代码

import socket

def find_available_port(start=23456):
    """自动查找可用端口"""
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.settimeout(1)
    for port in range(start, start + 100):
        try:
            result = sock.bind(('', port))
            sock.close()
            return port
        except:
            continue
    return None

# 使用
port = find_available_port(23456)
torchrun_cmd = [
    '--master_port', str(port) if port else '23456',
]

方案3：让torchrun处理（最简单）

python 复制代码

# 不指定master_port，让torchrun自动选择
torchrun_cmd = [
    '--master_addr', 'localhost',
    # 不指定master_port
    spark_train_script
]

总结

你不需要知道Executor的IP！

原因：

✅ 所有DDP进程在同一executor上运行
✅ 使用localhost通信（本地回环）
✅ executor的IP无关紧要
✅ 每个executor的localhost是独立的

端口选择

当前配置：

python 复制代码

'--master_addr', 'localhost',  # ✅ 正确
'--master_port', '23456',      # ✅ 通常可用

为什么工作：

localhost在executor内部
23456端口在executor内部使用
不同executor之间互不干扰

如果端口冲突

处理方式：

python 复制代码

# 改端口
'--master_port', '23457'

# 或让torchrun自动分配
# 不指定--master_port参数

实际操作

当前代码（spark_train_ddp_wrapper.py）

python 复制代码

torchrun_cmd = [
    sys.executable,
    '-m', 'torch.distributed.run',
    '--nproc_per_node', str(num_processes),
    '--nnodes', '1',
    '--node_rank', '0',
    '--master_addr', 'localhost',    # ✅ 保持这个
    '--master_port', '23456',       # ✅ 保持这个（通常可用）
    spark_train_script
]

这是正确的配置，因为：

✅ 所有进程在同一executor上
✅ 通过localhost通信
✅ 不需要知道executor的IP
✅ 端口在executor内部使用，不会冲突

如果确实遇到端口冲突

修改为：

python 复制代码

'--master_port', '23457',  # 或其他端口

或让系统自动分配：

python 复制代码

# 移除--master_port参数
torchrun_cmd = [
    sys.executable,
    '-m', 'torch.distributed.run',
    '--nproc_per_node', str(num_processes),
    '--nnodes', '1',
    '--node_rank', '0',
    '--master_addr', 'localhost',
    # 不指定master_port
    spark_train_script
]

关键要点

不需要知道executor IP：使用localhost即可
端口独立性：不同executor的端口互不干扰
本地通信：所有DDP通信在executor内部进行
配置简单：localhost + 固定端口即可工作