作者的话 :在前面的强化学习系列中,我们学习了如何让AI通过与环境交互(试错)来学习最优策略。但这种方法存在明显的局限:需要大量的交互数据 ,探索过程可能危险 。而在现实世界中,我们往往有专家示范数据------人类驾驶员的驾驶记录、专家玩家的游戏录像、工人的操作示范......**模仿学习(Imitation Learning)**就是让AI向专家学习的技术。本文将带你深入理解模仿学习的原理、经典算法,并实现完整的模仿学习系统!
一、为什么需要模仿学习?
1.1 强化学习的局限
RL的核心问题:
| 问题 | 说明 | 后果 |
|---|---|---|
| 样本效率低 | 需要数百万次交互 | 训练时间长,计算成本高 |
| 探索风险 | 随机探索可能导致危险 | 难以用于安全关键系统 |
| 奖励设计难 | 需要手工设计奖励函数 | 容易走捷径 |
| 冷启动问题 | 初始策略完全随机 | 学习效率极低 |
1.2 模仿学习的优势
模仿学习的核心思想:
专家示范 → 监督学习 → 智能体策略
不需要与环境交互,直接从数据中学习!
优势:
| 优势 | 说明 |
|---|---|
| 样本效率高 | 使用专家数据,不需要探索 |
| 安全性好 | 不需要试错,避免危险行为 |
| 奖励无关 | 不需要设计奖励函数 |
| 冷启动友好 | 可以快速获得不错的初始策略 |
1.3 模仿学习的应用场景
| 领域 | 应用 | 专家数据来源 |
|---|---|---|
| 自动驾驶 | 端到端驾驶 | 人类驾驶员数据 |
| 机器人 | 抓取、操作 | 人类示教、遥操作 |
| 游戏AI | 游戏角色 | 人类玩家录像 |
| 对话系统 | 客服机器人 | 人类客服对话 |
| 医疗 | 诊断决策 | 医生诊断记录 |
二、模仿学习的分类
2.1 按学习方式分类
模仿学习
├── 行为克隆(Behavior Cloning, BC)
│ └── 直接监督学习:状态→动作
│ └── 最简单,但有复合误差问题
│
├── 逆强化学习(Inverse RL)
│ └── 从专家行为推断奖励函数
│ └── 再基于奖励函数进行RL
│
└── 交互式模仿学习
├── DAgger(Dataset Aggregation)
└── GAIL(Generative Adversarial IL)
2.2 按专家参与程度分类
| 类型 | 专家参与 | 代表算法 | 特点 |
|---|---|---|---|
| 离线模仿 | 仅提供数据集 | BC, GAIL | 专家不参与训练过程 |
| 在线交互 | 可查询 | DAgger | 智能体可以请专家示范 |
| 主动学习 | 选择性查询 | Active DAgger | 只在需要时查询专家 |
三、行为克隆(Behavior Cloning)
3.1 核心思想
将模仿学习看作监督学习问题:
min E[(s,a)~D][L(π_θ(s), a)]
其中:
- D = {(s1,a1), (s2,a2), ..., (sN,aN)}:专家示范数据集
- π_θ:策略网络
- L:损失函数(如MSE、交叉熵)
3.2 完整实现
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import Dataset, DataLoader
class ExpertDataset(Dataset):
"""专家示范数据集"""
def __init__(self, states, actions):
self.states = torch.FloatTensor(states)
if len(actions.shape) == 1:
self.actions = torch.LongTensor(actions)
self.discrete = True
else:
self.actions = torch.FloatTensor(actions)
self.discrete = False
def __len__(self):
return len(self.states)
def __getitem__(self, idx):
return self.states[idx], self.actions[idx]
class BCAgent(nn.Module):
"""行为克隆智能体"""
def __init__(self, state_dim, action_dim, hidden_dims=[256, 256], discrete=False):
super(BCAgent, self).__init__()
self.discrete = discrete
self.action_dim = action_dim
# 构建网络
layers = []
prev_dim = state_dim
for hidden_dim in hidden_dims:
layers.extend([
nn.Linear(prev_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2)
])
prev_dim = hidden_dim
self.feature_extractor = nn.Sequential(*layers)
self.output_layer = nn.Linear(prev_dim, action_dim)
self.tanh = nn.Tanh()
def forward(self, state):
features = self.feature_extractor(state)
if self.discrete:
return self.output_layer(features)
else:
return self.tanh(self.output_layer(features))
def predict(self, state):
if isinstance(state, np.ndarray):
state = torch.FloatTensor(state)
if len(state.shape) == 1:
state = state.unsqueeze(0)
with torch.no_grad():
output = self.forward(state)
if self.discrete:
action = torch.argmax(output, dim=-1)
else:
action = output
return action.cpu().numpy()
class BehaviorCloning:
"""行为克隆训练器"""
def __init__(self, state_dim, action_dim, discrete=False, lr=1e-3):
self.agent = BCAgent(state_dim, action_dim, discrete=discrete)
self.discrete = discrete
self.optimizer = optim.Adam(self.agent.parameters(), lr=lr)
self.scheduler = optim.lr_scheduler.StepLR(
self.optimizer, step_size=50, gamma=0.5
)
def train(self, dataset, batch_size=64, epochs=100, val_split=0.2):
# 划分训练集和验证集
val_size = int(len(dataset) * val_split)
train_size = len(dataset) - val_size
train_dataset, val_dataset = torch.utils.data.random_split(
dataset, [train_size, val_size]
)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
criterion = nn.CrossEntropyLoss() if self.discrete else nn.MSELoss()
best_val_loss = float('inf')
for epoch in range(epochs):
# 训练
self.agent.train()
train_loss = 0
for states, actions in train_loader:
self.optimizer.zero_grad()
outputs = self.agent(states)
loss = criterion(outputs, actions)
loss.backward()
self.optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
# 验证
self.agent.eval()
val_loss = 0
with torch.no_grad():
for states, actions in val_loader:
outputs = self.agent(states)
loss = criterion(outputs, actions)
val_loss += loss.item()
val_loss /= len(val_loader)
self.scheduler.step()
if val_loss < best_val_loss:
best_val_loss = val_loss
self.best_model_state = self.agent.state_dict().copy()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch + 1}/{epochs}, "
f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
self.agent.load_state_dict(self.best_model_state)
3.3 行为克隆的问题
复合误差问题(Compounding Errors):
问题描述:
训练时:智能体看到的都是专家访问过的状态
测试时:智能体可能访问训练时未见过的状态
↓
在这些状态上犯错
↓
进入更多未见过的状态
↓
错误累积,性能崩溃
误差分析:
假设策略在训练分布上的误差为ε,轨迹长度为T
则期望误差为:O(εT²)
这意味着误差会随轨迹长度平方增长!
四、DAgger:数据集聚合
4.1 核心思想
DAgger(Dataset Aggregation)的核心洞察:
智能体需要学会从自己的错误状态恢复!
算法流程:
初始化:D = 专家示范数据集
for iteration = 1, 2, ..., N:
1. 训练策略 π_θ 在 D 上
2. 运行 π_θ 收集新轨迹
3. 请专家标记这些轨迹上的动作
4. 将新数据加入 D
end
关键点:
- 智能体主动探索自己的状态分布
- 专家只标记动作,不需要自己执行
- 数据集会覆盖智能体实际访问的状态
4.2 完整实现
class DAggerTrainer:
"""DAgger(Dataset Aggregation)训练器"""
def __init__(self, env, expert_policy, state_dim, action_dim,
discrete=False, lr=1e-3):
self.env = env
self.expert_policy = expert_policy
self.discrete = discrete
# 初始化BC智能体
self.bc = BehaviorCloning(state_dim, action_dim, discrete=discrete, lr=lr)
# 数据集
self.all_states = []
self.all_actions = []
def collect_expert_data(self, n_trajectories=10, max_steps=500):
"""收集专家示范数据"""
states = []
actions = []
for ep in range(n_trajectories):
state = self.env.reset()
if isinstance(state, tuple):
state = state[0]
for step in range(max_steps):
action = self.expert_policy(state)
states.append(state)
actions.append(action)
result = self.env.step(action)
if len(result) == 5:
state, _, terminated, truncated, _ = result
done = terminated or truncated
else:
state, _, done, _ = result
if done:
break
return np.array(states), np.array(actions)
def collect_policy_data(self, max_steps=500):
"""运行当前策略收集数据"""
states = []
state = self.env.reset()
if isinstance(state, tuple):
state = state[0]
for step in range(max_steps):
states.append(state)
action = self.bc.agent.predict(state)[0]
result = self.env.step(action)
if len(result) == 5:
state, _, terminated, truncated, _ = result
done = terminated or truncated
else:
state, _, done, _ = result
if done:
break
return np.array(states)
def label_with_expert(self, states):
"""请专家标记动作"""
actions = []
for state in states:
action = self.expert_policy(state)
actions.append(action)
return np.array(actions)
def train(self, n_iterations=10, n_trajectories_per_iter=5,
max_steps=500, batch_size=64, epochs=20):
print("🚀 Starting DAgger Training...")
# 1. 初始专家数据收集
print("
📊 Collecting initial expert demonstrations...")
init_states, init_actions = self.collect_expert_data(
n_trajectories=5, max_steps=max_steps
)
self.all_states.extend(init_states)
self.all_actions.extend(init_actions)
print(f"Initial dataset size: {len(self.all_states)}")
for iteration in range(n_iterations):
print(f"
{'='*50}")
print(f"DAgger Iteration {iteration + 1}/{n_iterations}")
print(f"{'='*50}")
# 2. 训练策略
print("
🎓 Training policy...")
dataset = ExpertDataset(
np.array(self.all_states),
np.array(self.all_actions)
)
self.bc.train(dataset, batch_size=batch_size, epochs=epochs)
# 3. 运行策略收集新状态
print("
🎮 Collecting new states with current policy...")
new_states_list = []
for _ in range(n_trajectories_per_iter):
states = self.collect_policy_data(max_steps=max_steps)
new_states_list.append(states)
new_states = np.vstack(new_states_list)
print(f"Collected {len(new_states)} new states")
# 4. 专家标记新状态
print("
👨🏫 Labeling new states with expert...")
new_actions = self.label_with_expert(new_states)
# 5. 添加到数据集
self.all_states.extend(new_states)
self.all_actions.extend(new_actions)
print(f"Dataset size: {len(self.all_states)}")
# 6. 评估当前策略
avg_reward = self.evaluate(n_episodes=5, max_steps=max_steps)
print(f"
📈 Average Reward: {avg_reward:.2f}")
print("
✅ DAgger training completed!")
def evaluate(self, n_episodes=10, max_steps=500):
"""评估当前策略"""
total_reward = 0
for _ in range(n_episodes):
state = self.env.reset()
if isinstance(state, tuple):
state = state[0]
episode_reward = 0
for step in range(max_steps):
action = self.bc.agent.predict(state)[0]
result = self.env.step(action)
if len(result) == 5:
state, reward, terminated, truncated, _ = result
done = terminated or truncated
else:
state, reward, done, _ = result
episode_reward += reward
if done:
break
total_reward += episode_reward
return total_reward / n_episodes
五、逆强化学习简介
5.1 核心思想
正向RL vs 逆向IRL:
正向RL:
奖励函数 R → 最优策略 π*
逆RL:
专家策略 π_E → 奖励函数 R → 最优策略 π* ≈ π_E
为什么需要IRL?
- 奖励函数往往比策略更简洁
- 学到的奖励可以迁移到新环境
- 更好地理解专家的意图
5.2 最大熵IRL
目标:找到能够解释专家行为的奖励函数
max_R Σ log π_R(a|s) - λ||R||²
其中 π_R 是在奖励函数 R 下的最优策略。
六、实战项目:CartPole模仿学习
6.1 环境设置
import gym
import numpy as np
class CartPoleExpert:
"""CartPole专家策略"""
def __call__(self, state):
pos, vel, angle, angle_vel = state
# 简单启发式:如果杆向左倾斜,向左推
if angle < 0:
return 0
else:
return 1
def get_action(self, state):
return self(state)
6.2 完整训练流程
class CartPoleImitationLearning:
"""CartPole模仿学习完整实现"""
def __init__(self):
self.env = gym.make('CartPole-v1')
self.state_dim = self.env.observation_space.shape[0]
self.action_dim = self.env.action_space.n
self.expert = CartPoleExpert()
self.bc_history = []
self.dagger_history = []
def train_behavior_cloning(self, n_trajectories=50, epochs=100):
print("="*60)
print("Training Behavior Cloning")
print("="*60)
# 收集专家数据
print("
📊 Collecting expert data...")
states = []
actions = []
for ep in range(n_trajectories):
state = self.env.reset()
if isinstance(state, tuple):
state = state[0]
done = False
while not done:
action = self.expert(state)
states.append(state)
actions.append(action)
result = self.env.step(action)
if len(result) == 5:
state, _, terminated, truncated, _ = result
done = terminated or truncated
else:
state, _, done, _ = result
states = np.array(states)
actions = np.array(actions)
print(f"Collected {len(states)} expert state-action pairs")
# 训练BC
print("
🎓 Training BC model...")
bc = BehaviorCloning(self.state_dim, self.action_dim, discrete=True)
dataset = ExpertDataset(states, actions)
bc.train(dataset, batch_size=64, epochs=epochs)
# 评估
print("
🧪 Evaluating BC...")
rewards = []
for _ in range(20):
reward = self.evaluate_policy(bc.agent)
rewards.append(reward)
avg_reward = np.mean(rewards)
print(f"Average Reward: {avg_reward:.2f}")
self.bc_model = bc
self.bc_history = rewards
return avg_reward
def train_dagger(self, n_iterations=10):
print("
" + "="*60)
print("Training DAgger")
print("="*60)
dagger = DAggerTrainer(
self.env, self.expert,
self.state_dim, self.action_dim,
discrete=True
)
dagger.train(n_iterations=n_iterations)
# 评估
print("
🧪 Evaluating DAgger...")
rewards = []
for _ in range(20):
reward = self.evaluate_policy(dagger.bc.agent)
rewards.append(reward)
avg_reward = np.mean(rewards)
print(f"Average Reward: {avg_reward:.2f}")
self.dagger_model = dagger
self.dagger_history = rewards
return avg_reward
def evaluate_policy(self, policy, n_episodes=1, max_steps=500):
total_reward = 0
for _ in range(n_episodes):
state = self.env.reset()
if isinstance(state, tuple):
state = state[0]
episode_reward = 0
for step in range(max_steps):
action = policy.predict(state)[0]
result = self.env.step(action)
if len(result) == 5:
state, reward, terminated, truncated, _ = result
done = terminated or truncated
else:
state, reward, done, _ = result
episode_reward += reward
if done:
break
total_reward += episode_reward
return total_reward / n_episodes
def compare_results(self):
print("
" + "="*60)
print("Comparison: BC vs DAgger")
print("="*60)
bc_avg = np.mean(self.bc_history)
dagger_avg = np.mean(self.dagger_history)
print(f"Behavior Cloning: {bc_avg:.2f} ± {np.std(self.bc_history):.2f}")
print(f"DAgger: {dagger_avg:.2f} ± {np.std(self.dagger_history):.2f}")
print(f"Improvement: {dagger_avg - bc_avg:.2f}")
# 运行完整实验
if __name__ == "__main__":
experiment = CartPoleImitationLearning()
bc_reward = experiment.train_behavior_cloning(n_trajectories=50, epochs=100)
dagger_reward = experiment.train_dagger(n_iterations=10)
experiment.compare_results()
6.3 预期结果
============================================================
Training Behavior Cloning
============================================================
📊 Collecting expert data...
Collected 2500 expert state-action pairs
🎓 Training BC model...
Epoch 100/100, Train Loss: 0.0123, Val Loss: 0.0891
🧪 Evaluating BC...
Average Reward: 150.5 ± 45.2
============================================================
Training DAgger
============================================================
DAgger Iteration 10/10
Dataset size: 15000
Average Reward: 485.2
🧪 Evaluating DAgger...
Average Reward: 478.9 ± 23.4
============================================================
Comparison: BC vs DAgger
============================================================
Behavior Cloning: 150.5 ± 45.2
DAgger: 478.9 ± 23.4
Improvement: 328.4
七、模仿学习与RL的结合
7.1 预训练+微调
# 步骤1:使用模仿学习预训练
bc_agent.train(expert_dataset)
# 步骤2:使用RL微调
rl_agent.load(bc_agent.state_dict())
rl_agent.train_with_ppo(env) # 或使用其他RL算法
7.2 RLHF(从人类反馈的强化学习)
流程:
1. 预训练语言模型
2. 收集人类偏好数据
3. 训练奖励模型
4. 使用PPO优化策略
这就是ChatGPT的训练流程!
八、总结
8.1 模仿学习的核心要点
- 核心优势:高样本效率、安全性好、不需要奖励函数
- 主要方法:BC(简单但有复合误差)、DAgger(交互式)、IRL(优雅但复杂)
- 适用场景:有专家示范数据、探索成本高或危险、需要快速获得初始策略
8.2 学习路径总结
第33-36篇:强化学习(Model-Free)
├── Q-Learning / DQN
├── Actor-Critic / PPO
└── MARL
第37篇:模型预测控制(Model-Based)
└── MPC
第38篇:模仿学习(本篇文章)
├── 行为克隆(BC)
├── DAgger
└── 逆强化学习(IRL)
结合应用
├── 预训练+微调
└── RLHF(ChatGPT)
下一篇预告:【第39篇】元学习入门:让AI学会如何学习
我们将进入元学习领域,探索如何让AI快速适应新任务,实现"学习如何学习"的能力!
本文为系列第38篇,详细介绍了模仿学习的原理与实战。有任何问题欢迎在评论区交流!
标签:模仿学习、Imitation Learning、行为克隆、DAgger、逆强化学习、专家示范