人工智能【第36篇】多智能体强化学习入门：多个AI的协作与竞争

作者的话 ：在前面的文章中，我们学习了单智能体强化学习------一个AI在一个环境中学习最优策略。但现实世界中的许多问题涉及多个智能体 同时决策和交互：自动驾驶车队、机器人协作、游戏对战、经济市场......这就是**多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）**的研究范畴。本文将带你理解MARL的核心概念、经典算法，并实现一个多智能体协作场景！

一、为什么需要多智能体强化学习？

1.1 单智能体的局限

回顾单智能体RL的假设：

环境是静态的 、可预测的
智能体是环境中唯一的决策者
环境的变化完全由智能体的动作引起

现实世界的问题：

应用场景	参与方	交互类型
自动驾驶	多辆车、行人	协作+竞争
机器人足球	两队机器人	完全竞争
资源分配	多个用户	竞争
无人机编队	多架无人机	协作
游戏对战	多个玩家	协作/竞争

1.2 多智能体系统的特点

特性	单智能体	多智能体
环境动态性	稳定	非平稳（Non-stationary）
状态空间	单个	联合状态空间（指数增长）
动作空间	单个	联合动作空间
学习难度	相对简单	极具挑战

1.3 多智能体系统的分类

按交互类型分类：

完全协作（Fully Cooperative）：所有智能体共享相同奖励函数。例：无人机编队、协作搬运
完全竞争（Fully Competitive）：零和博弈。例：围棋、扑克、足球对抗
混合博弈（Mixed）：既有协作又有竞争。例：自动驾驶、市场经济

二、MARL的核心概念

2.1 随机博弈（Stochastic Game）

单智能体MDP的扩展：

复制代码

G = (N, S, {A^i}, P, {R^i}, γ)

其中：
- N = {1, 2, ..., n}：智能体集合
- S：全局状态空间
- A^i：智能体i的动作空间
- P(s'|s, a)：联合转移概率，a = (a^1, ..., a^n)
- R^i(s, a)：智能体i的奖励函数
- γ：折扣因子

2.2 纳什均衡（Nash Equilibrium）

定义：在一个策略组合中，任何智能体单独改变自己的策略都无法获得更高的收益。

2.3 三种学习范式

范式	说明	优缺点
独立学习	每个智能体独立运行单智能体算法	简单但非平稳
联合学习	将所有智能体看作一个"超级智能体"	最优但维度灾难
CTDE	集中训练分散执行	平衡性能和可扩展性

三、MARL经典算法

3.1 独立Q-Learning（IQL）

复制代码

import numpy as np

class IndependentQLearning:
    """独立Q-Learning：每个智能体独立学习"""
    def __init__(self, n_agents, n_states, n_actions, alpha=0.1, 
                 gamma=0.95, epsilon=0.1):
        self.n_agents = n_agents
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        
        # 每个智能体有自己的Q表
        self.q_tables = [
            np.zeros((n_states, n_actions)) 
            for _ in range(n_agents)
        ]
    
    def select_actions(self, states):
        """为所有智能体选择动作"""
        actions = []
        for i, state in enumerate(states):
            if np.random.random() < self.epsilon:
                action = np.random.randint(self.q_tables[i].shape[1])
            else:
                action = np.argmax(self.q_tables[i][state])
            actions.append(action)
        return actions
    
    def update(self, states, actions, rewards, next_states, dones):
        """更新所有智能体的Q表"""
        for i in range(self.n_agents):
            s, a, r, s_next = states[i], actions[i], rewards[i], next_states[i]
            
            current_q = self.q_tables[i][s, a]
            max_next_q = np.max(self.q_tables[i][s_next]) if not dones[i] else 0
            
            self.q_tables[i][s, a] = current_q + self.alpha * (
                r + self.gamma * max_next_q - current_q
            )

IQL的局限性：

非平稳性：其他智能体的策略变化导致环境动态变化
学习不稳定：难以收敛到均衡
协作困难：无法学习协调策略

3.2 MADDPG：多智能体DDPG

算法来源：OpenAI, 2017

核心思想：Actor-Critic架构扩展到多智能体

关键创新：

Critic可以使用全局信息（其他智能体的状态和动作）
Actor只能使用局部信息

import torch
import torch.nn as nn
import torch.optim as optim

class Actor(nn.Module):
"""Actor网络：只使用局部观测"""
def init(self, obs_dim, action_dim, hidden_dim=64):
super(Actor, self).init()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Tanh()
)
复制代码
```
  def forward(self, obs):
      return self.net(obs)
```
class Critic(nn.Module):
"""Critic网络：使用全局信息"""
def init(self, total_obs_dim, total_action_dim, hidden_dim=64):
super(Critic, self).init()
self.net = nn.Sequential(
nn.Linear(total_obs_dim + total_action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
复制代码
```
  def forward(self, all_obs, all_actions):
      x = torch.cat([all_obs, all_actions], dim=-1)
      return self.net(x)
```

3.3 QMIX：值分解网络

算法来源：DeepMind, 2018

适用场景：完全协作任务（所有智能体共享奖励）

核心思想：

学习联合动作值函数 Q_tot

将 Q_tot 分解为各智能体Q值的单调组合

class QMIXMixer(nn.Module):
"""QMIX混合网络"""
def init(self, n_agents, state_dim, hidden_dim=32):
super(QMIXMixer, self).init()
self.n_agents = n_agents

复制代码

      # 超网络生成权重
      self.hyper_w1 = nn.Linear(state_dim, n_agents * hidden_dim)
      self.hyper_b1 = nn.Linear(state_dim, hidden_dim)
      self.hyper_w2 = nn.Linear(state_dim, hidden_dim)
      self.hyper_b2 = nn.Sequential(
          nn.Linear(state_dim, hidden_dim),
          nn.ReLU(),
          nn.Linear(hidden_dim, 1)
      )
  
  def forward(self, agent_qs, states):
      """agent_qs: [batch, n_agents], states: [batch, state_dim]"""
      batch_size = agent_qs.size(0)
      
      # 第一层（绝对值保证单调性）
      w1 = torch.abs(self.hyper_w1(states))
      w1 = w1.view(batch_size, self.n_agents, -1)
      b1 = self.hyper_b1(states).view(batch_size, 1, -1)
      
      hidden = torch.relu(torch.bmm(agent_qs.unsqueeze(1), w1) + b1)
      
      # 第二层
      w2 = torch.abs(self.hyper_w2(states)).view(batch_size, -1, 1)
      b2 = self.hyper_b2(states).view(batch_size, 1, 1)
      
      q_tot = torch.bmm(hidden, w2) + b2
      return q_tot.squeeze(1)

四、MARL算法对比

算法	学习范式	适用场景	特点
IQL	独立学习	简单场景	简单但非平稳
MADDPG	CTDE	连续动作	适用于混合博弈
QMIX	CTDE	离散动作、协作	值分解，可扩展
COMA	CTDE	协作	解决信用分配
MAPPO	CTDE	大规模	PPO的多智能体版本

五、实战项目：多智能体捕食者-猎物

5.1 环境设计

场景：

猎物（Prey）：1个，速度快，被捕获则游戏结束
捕食者（Predators）：3个，速度慢，需要协作包围猎物

规则：

捕食者同时接触猎物 → 捕获成功，获得正奖励

时间步惩罚：鼓励快速捕获

import numpy as np
import matplotlib.pyplot as plt

class PredatorPreyEnv:
"""多智能体捕食者-猎物环境"""
def init(self, grid_size=10, n_predators=3, max_steps=50):
self.grid_size = grid_size
self.n_predators = n_predators
self.n_agents = n_predators + 1
self.max_steps = max_steps
self.n_actions = 5 # 0=上, 1=下, 2=左, 3=右, 4=不动
self.reset()

复制代码

  def reset(self):
      self.steps = 0
      self.positions = np.random.randint(0, self.grid_size, 
                                        size=(self.n_agents, 2))
      return self.get_observations()
  
  def step(self, actions):
      self.steps += 1
      
      # 移动智能体
      moves = {
          0: np.array([-1, 0]), 1: np.array([1, 0]),
          2: np.array([0, -1]), 3: np.array([0, 1]),
          4: np.array([0, 0])
      }
      
      # 捕食者移动
      for i in range(self.n_predators):
          if np.random.random() < 0.8:
              self.positions[i] = np.clip(
                  self.positions[i] + moves[actions[i]],
                  0, self.grid_size - 1
              )
      
      # 猎物移动
      prey_action = np.random.randint(0, 5)
      self.positions[-1] = np.clip(
          self.positions[-1] + moves[prey_action],
          0, self.grid_size - 1
      )
      
      rewards = self.compute_rewards()
      caught = self.check_capture()
      done = caught or self.steps >= self.max_steps
      
      return self.get_observations(), rewards, done, {'caught': caught}
  
  def check_capture(self):
      prey_pos = self.positions[-1]
      predators_near = sum([
          np.linalg.norm(self.positions[i] - prey_pos) < 1.5
          for i in range(self.n_predators)
      ])
      return predators_near >= 2

5.2 使用IQL训练

复制代码

class PredatorPreyTrainer:
    def __init__(self):
        self.env = PredatorPreyEnv(grid_size=10, n_predators=3)
        self.agents = IndependentQLearning(
            n_agents=self.env.n_predators,
            n_states=100, n_actions=5,
            alpha=0.1, gamma=0.95, epsilon=0.3
        )
        self.episode_rewards = []
        self.capture_rates = []
    
    def train(self, n_episodes=10000):
        n_captures = 0
        
        for episode in range(n_episodes):
            observations = self.env.reset()
            episode_reward = np.zeros(self.env.n_predators)
            done = False
            
            while not done:
                # 选择动作
                predator_states = [self.pos_to_state(self.env.positions[i])
                                  for i in range(self.env.n_predators)]
                predator_actions = self.agents.select_actions(predator_states)
                prey_action = np.random.randint(0, 5)
                actions = predator_actions + [prey_action]
                
                # 执行
                next_obs, rewards, done, info = self.env.step(actions)
                
                # 更新
                next_states = [self.pos_to_state(self.env.positions[i])
                              for i in range(self.env.n_predators)]
                predator_rewards = [rewards[i] for i in range(self.env.n_predators)]
                
                self.agents.update(predator_states, predator_actions,
                                  predator_rewards, next_states, [done]*3)
                
                episode_reward += np.array(predator_rewards)
                if done and info['caught']:
                    n_captures += 1
            
            self.episode_rewards.append(episode_reward.mean())
            self.agents.decay_exploration()
            
            if (episode + 1) % 500 == 0:
                capture_rate = n_captures / 500
                self.capture_rates.append(capture_rate)
                avg_reward = np.mean(self.episode_rewards[-500:])
                print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}, "
                      f"Capture Rate: {capture_rate:.2%}")
                n_captures = 0

5.3 预期训练结果

复制代码

Episode 500, Avg Reward: -12.34, Capture Rate: 8.20%, Epsilon: 0.243
Episode 1000, Avg Reward: -8.56, Capture Rate: 15.40%
Episode 3000, Avg Reward: -2.12, Capture Rate: 32.80%
Episode 5000, Avg Reward: 1.45, Capture Rate: 48.60%
Episode 8000, Avg Reward: 3.78, Capture Rate: 55.20%
Episode 10000, Avg Reward: 4.23, Capture Rate: 58.40%

六、MARL的前沿与挑战

6.1 当前挑战

挑战	说明	研究方向
可扩展性	智能体数量增加导致维度灾难	均值场近似、分层MARL
通信学习	智能体如何有效通信	可微分通信
对手建模	对其他智能体的建模	心智理论
安全MARL	确保协作行为的安全性	安全强化学习扩展

6.2 实际应用案例

OpenAI Hide and Seek：智能体自发学会使用工具，出现涌现行为（团队配合、封锁出口）

AlphaStar（星际争霸II）：多智能体协作对抗人类，使用League训练避免策略退化

Waymo自动驾驶：多车协作决策，预测其他车辆行为

七、总结

7.1 MARL的核心要点

核心概念：随机博弈、非平稳性、信用分配
三种学习范式：独立学习、联合学习、CTDE
经典算法：IQL、MADDPG、QMIX

7.2 学习路径总结

复制代码

第33篇：Q-Learning & DQN（单智能体基础）
  ↓
第34篇：Actor-Critic（连续动作）
  ↓
第35篇：PPO（策略优化）
  ↓
第36篇：MARL（本篇文章）
  ↓
下一步：Meta-RL / 模仿学习 / 前沿研究

下一篇预告：【第37篇】模型预测控制MPC：AI的预测与规划能力

我们将从学习-based方法转向模型-based方法，探索如何利用系统动力学模型进行预测和优化！

本文为系列第36篇，介绍了多智能体强化学习的核心概念与算法。有任何问题欢迎在评论区交流！

标签：多智能体强化学习、MARL、MADDPG、QMIX、协作学习、涌现行为