《机器学习》第六章-强化学习

🔥 本文配套完整可运行代码,所有案例均经过实测,包含可视化对比、核心算法实现、实战项目,零基础也能轻松上手强化学习!

一、前言

强化学习(Reinforcement Learning, RL)作为机器学习的三大分支之一(监督学习、无监督学习、强化学习),核心是通过 "试错" 与 "奖励" 机制让智能体自主学习最优决策策略。本文基于《机器学习》第 6 章内容,从基础概念到实战应用,全方位讲解强化学习,所有代码均可直接运行,附带可视化对比,帮你彻底搞懂强化学习!

二、正文目录

6.1 强化学习概述

6.1.1 强化学习基本知识

核心概念(通俗解释)
  • 智能体(Agent):学习的主体(比如下棋的 AI、自动驾驶的决策系统)
  • 环境(Environment):智能体所处的场景(比如棋盘、道路环境)
  • 状态(State):智能体当前的处境(比如棋盘上棋子的位置)
  • 动作(Action):智能体可以执行的操作(比如下棋落子、小车转向)
  • 奖励(Reward):环境对智能体动作的反馈(赢棋 + 10 分、撞墙 - 5 分)
  • 策略(Policy):智能体选择动作的规则(核心学习目标)
核心流程

6.1.2 马尔可夫模型

核心思想

马尔可夫性:未来只依赖现在,与过去无关(比如下一秒的天气只和现在有关,和昨天无关)。强化学习中,状态转移满足马尔可夫性是核心假设,也是简化计算的关键。

代码实现:马尔可夫链示例
复制代码
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = ["SimHei", "WenQuanYi Micro Hei", "Heiti TC"]  # 支持中文显示

# 定义马尔可夫链的状态转移矩阵(3个状态:晴天、阴天、雨天)
transition_matrix = np.array([
    [0.8, 0.15, 0.05],  # 晴天→晴天(80%)、阴天(15%)、雨天(5%)
    [0.2, 0.7, 0.1],    # 阴天→晴天(20%)、阴天(70%)、雨天(10%)
    [0.1, 0.3, 0.6]     # 雨天→晴天(10%)、阴天(30%)、雨天(60%)
])

# 状态名称
states = ["晴天", "阴天", "雨天"]
# 初始状态分布(随机初始)
current_state_dist = np.array([1, 0, 0])  # 初始为晴天
# 记录各状态概率变化
state_history = []

# 模拟10步状态转移
for step in range(10):
    state_history.append(current_state_dist.copy())
    # 状态转移:当前分布 × 转移矩阵
    current_state_dist = current_state_dist @ transition_matrix

# 可视化状态概率变化
state_history = np.array(state_history)
plt.figure(figsize=(10, 6))
for i, state in enumerate(states):
    plt.plot(state_history[:, i], label=state, marker='o')
plt.xlabel("步数")
plt.ylabel("状态概率")
plt.title("马尔可夫链状态转移可视化")
plt.legend()
plt.grid(True)
plt.show()

运行效果:

(注:图中蓝色线是 MC 完整轨迹的累积奖励,红色线是 TD 每一步的估计值,能直观看到 MC 更平滑但需要等待轨迹结束,TD 实时性更强但波动更大)

6.2 基本强化学习

6.2.1 值迭代学习

核心思想

值迭代(Value Iteration)是求解马尔可夫决策过程(MDP)的经典算法,核心是迭代更新每个状态的价值,直到收敛,最终得到最优策略。

完整代码实现
复制代码
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = ["SimHei", "WenQuanYi Micro Hei", "Heiti TC"]

# 1. 定义环境(4x4网格世界)
# 状态:0-14(15个状态),15是终止状态(右下角)
# 动作:0=上,1=右,2=下,3=左
n_states = 16
n_actions = 4
gamma = 0.9  # 折扣因子
theta = 1e-6  # 收敛阈值

# 2. 定义状态转移和奖励函数
def step(state, action):
    """执行动作,返回新状态和奖励"""
    # 终止状态
    if state == 15:
        return 15, 0
    # 计算坐标
    x = state % 4
    y = state // 4
    # 执行动作
    if action == 0:  # 上
        y = max(y-1, 0)
    elif action == 1:  # 右
        x = min(x+1, 3)
    elif action == 2:  # 下
        y = min(y+1, 3)
    elif action == 3:  # 左
        x = max(x-1, 0)
    # 新状态
    new_state = y * 4 + x
    # 奖励:到达终止状态+1,其他-0.1(鼓励尽快到达终点)
    reward = 1 if new_state == 15 else -0.1
    return new_state, reward

# 3. 值迭代算法实现
def value_iteration():
    # 初始化状态价值
    V = np.zeros(n_states)
    iteration = 0
    V_history = [V.copy()]  # 记录价值变化
    
    while True:
        delta = 0  # 价值最大变化量
        # 遍历所有状态
        for s in range(n_states):
            if s == 15:  # 终止状态跳过
                continue
            # 计算每个动作的价值
            action_values = []
            for a in range(n_actions):
                s_new, r = step(s, a)
                action_values.append(r + gamma * V[s_new])
            # 更新状态价值(取最大动作价值)
            v_new = max(action_values)
            delta = max(delta, abs(v_new - V[s]))
            V[s] = v_new
        
        V_history.append(V.copy())
        iteration += 1
        
        # 收敛判断
        if delta < theta:
            break
    
    # 从价值函数提取最优策略
    policy = np.zeros(n_states, dtype=int)
    for s in range(n_states):
        if s == 15:
            continue
        action_values = []
        for a in range(n_actions):
            s_new, r = step(s, a)
            action_values.append(r + gamma * V[s_new])
        policy[s] = np.argmax(action_values)
    
    return V, policy, V_history, iteration

# 4. 运行值迭代
V_opt, policy_opt, V_history, n_iter = value_iteration()

# 5. 可视化结果
# 5.1 状态价值变化
plt.figure(figsize=(10, 6))
# 选取几个关键状态
key_states = [0, 3, 12, 14]
state_names = ["左上角", "右上角", "左下角", "右下角(终止前)"]
for i, s in enumerate(key_states):
    values = [v[s] for v in V_history]
    plt.plot(values, label=f"状态{state_names[i]}", marker='.')
plt.xlabel("迭代次数")
plt.ylabel("状态价值")
plt.title(f"值迭代状态价值收敛过程(总迭代次数:{n_iter})")
plt.legend()
plt.grid(True)

# 5.2 最优策略可视化
plt.figure(figsize=(8, 8))
action_symbols = ['↑', '→', '↓', '←']
policy_grid = policy_opt.reshape(4, 4)
V_grid = V_opt.reshape(4, 4)

# 绘制网格和策略
for i in range(4):
    for j in range(4):
        if i == 3 and j == 3:  # 终止状态
            text = "终止"
        else:
            text = f"{action_symbols[policy_grid[i,j]]}\n{V_grid[i,j]:.2f}"
        plt.text(j, 3-i, text, ha='center', va='center', fontsize=12)

plt.xlim(-0.5, 3.5)
plt.ylim(-0.5, 3.5)
plt.xticks(range(4))
plt.yticks(range(4))
plt.grid(True)
plt.title("4x4网格世界最优策略(动作)+ 状态价值")
plt.gca().set_aspect('equal', adjustable='box')
plt.show()

# 输出结果
print("最优状态价值(4x4网格):")
print(V_opt.reshape(4, 4))
print("\n最优策略(0=上,1=右,2=下,3=左):")
print(policy_opt.reshape(4, 4))
运行效果
  • 图 1:关键状态的价值随迭代次数收敛的折线图,能看到价值逐渐稳定
  • 图 2:4x4 网格的最优策略可视化,每个格子显示最优动作和状态价值

6.2.2 时序差分学习

核心思想
完整代码实现
python 复制代码
import numpy as np
import matplotlib.pyplot as plt

# 解决中文和负号显示问题(完整的字体配置)
plt.rcParams["font.family"] = ["SimHei", "Microsoft YaHei", "DejaVu Sans"]
plt.rcParams['axes.unicode_minus'] = False  # 关键:解决负号显示问题
plt.rcParams['font.sans-serif'] = ['SimHei']


# 1. 定义环境(简化版网格世界)
class GridWorld:
    def __init__(self):
        self.rows = 3  # 行数(y轴)
        self.cols = 3  # 列数(x轴)
        self.end_state = (2, 2)  # 终点坐标 (x, y)
        self.current_state = (0, 0)  # 起点坐标 (x, y)

    def reset(self):
        """重置状态到起点"""
        self.current_state = (0, 0)
        return self.current_state

    def step(self, action):
        """执行动作:0=上,1=右,2=下,3=左"""
        x, y = self.current_state  # 当前状态 (x列, y行)

        # 动作映射修正:上/下对应y轴,左/右对应x轴
        if action == 0:  # 上:y轴减小
            y = max(y - 1, 0)
        elif action == 1:  # 右:x轴增大
            x = min(x + 1, 2)
        elif action == 2:  # 下:y轴增大
            y = min(y + 1, 2)
        elif action == 3:  # 左:x轴减小
            x = max(x - 1, 0)

        self.current_state = (x, y)
        # 奖励规则:到达终点+10,其他-1(鼓励尽快到达终点)
        reward = 10 if (x, y) == self.end_state else -1
        done = (x, y) == self.end_state  # 到达终点则结束本轮
        return self.current_state, reward, done


# 2. TD(0)算法实现(时序差分学习)
def td_learning(episodes=100, alpha=0.1, gamma=0.9):
    env = GridWorld()
    # 初始化状态价值函数 V[y, x]:y是行,x是列(对应网格坐标)
    V = np.zeros((env.rows, env.cols))
    total_rewards = []  # 记录每轮的总奖励(评估学习效果)

    for ep in range(episodes):
        state = env.reset()  # 每轮开始重置到起点
        done = False
        ep_reward = 0  # 本轮累计奖励

        while not done:
            x, y = state  # 当前状态 (x列, y行)
            # 纯随机选择动作(简化版探索策略,无ε-贪心)
            action = np.random.choice(4)
            # 执行动作,获取下一个状态、奖励、是否结束
            next_state, reward, done = env.step(action)
            nx, ny = next_state  # 下一个状态的坐标

            ep_reward += reward  # 累计本轮奖励

            # TD(0)核心更新公式
            # V(S_t) ← V(S_t) + α[R_t+1 + γ*V(S_t+1) - V(S_t)]
            # 其中:R_t+1是即时奖励,γ是折扣因子,α是学习率
            V[y, x] += alpha * (reward + gamma * V[ny, nx] - V[y, x])

            state = next_state  # 状态转移

        total_rewards.append(ep_reward)  # 记录本轮总奖励

    return V, total_rewards


# 3. 运行TD学习
V_td, rewards_td = td_learning(episodes=100, alpha=0.1, gamma=0.9)

# 4. 可视化结果
# 4.1 总奖励变化(学习曲线)
plt.figure(figsize=(10, 5))
plt.plot(rewards_td, label="每轮总奖励", color='blue', alpha=0.7)
# 移动平均平滑(消除随机波动,更易看趋势)
window = 5
smoothed = np.convolve(rewards_td, np.ones(window) / window, mode='valid')
plt.plot(range(window - 1, len(rewards_td)), smoothed,
         label=f"移动平均(窗口={window})", color='red', linewidth=2)
plt.xlabel("训练轮数")
plt.ylabel("总奖励")
plt.title("TD学习奖励变化曲线(奖励越高表示学习效果越好)")
plt.legend()
plt.grid(True, alpha=0.3)

# 4.2 状态价值可视化(网格形式,更直观)
plt.figure(figsize=(6, 6))
# 绘制网格线
for i in range(4):
    plt.axhline(y=i - 0.5, color='black', linewidth=0.5)
    plt.axvline(x=i - 0.5, color='black', linewidth=0.5)

# 填充每个格子的价值
for y in range(3):
    for x in range(3):
        if (x, y) == (2, 2):
            text = "终点\n10.00"
            color = 'green'
        else:
            text = f"{V_td[y, x]:.2f}"
            color = 'black'
        # 坐标对应:plt的x是网格x,plt的y是 2-y(翻转y轴,让起点在左上角)
        plt.text(x, 2 - y, text, ha='center', va='center', fontsize=14, color=color)

plt.xlim(-0.5, 2.5)
plt.ylim(-0.5, 2.5)
plt.xticks(range(3))
plt.yticks(range(3))
plt.grid(True, alpha=0.3)
plt.title("TD学习后的状态价值分布(3x3网格世界)")
plt.gca().set_aspect('equal', adjustable='box')
plt.show()

# 输出结果
print("TD学习后的状态价值矩阵 V[y, x](行y,列x):")
print(V_td)
运行效果
  • 图 1:每轮总奖励的变化曲线,能看到奖励逐渐上升(学习效果提升)
  • 图 2:3x3 网格的状态价值分布,终点价值最高,离终点越近价值越高

6.2.3 Q 学习

核心思想
完整代码实现(经典悬崖行走问题)
python 复制代码
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams["font.family"] = ["SimHei"]
plt.rcParams['axes.unicode_minus'] = False

# 1. 定义悬崖行走环境
class CliffWalkingEnv:
    def __init__(self):
        self.rows = 4
        self.cols = 12
        self.start = (0, 0)
        self.goal = (0, 11)
        self.cliff = [(0, i) for i in range(1, 11)]  # 悬崖区域
        self.current_state = self.start

    def reset(self):
        """重置状态到起点"""
        self.current_state = self.start
        return self.current_state

    def step(self, action):
        """执行动作:0=上,1=右,2=下,3=左"""
        x, y = self.current_state

        # 执行动作
        if action == 0:  # 上
            x = max(x - 1, 0)
        elif action == 1:  # 右
            y = min(y + 1, self.cols - 1)
        elif action == 2:  # 下
            x = min(x + 1, self.rows - 1)
        elif action == 3:  # 左
            y = max(y - 1, 0)

        new_state = (x, y)
        self.current_state = new_state

        # 奖励设计:掉悬崖-100,到达终点+10,其他-1
        if new_state in self.cliff:
            reward = -100
            done = True  # 掉悬崖结束本轮
            self.current_state = self.start  # 重置到起点
        elif new_state == self.goal:
            reward = 10
            done = True
        else:
            reward = -1
            done = False

        return new_state, reward, done


# 2. Q学习算法实现
def q_learning(episodes=500, alpha=0.1, gamma=0.9, epsilon=0.1):
    env = CliffWalkingEnv()
    # 初始化Q表:[行, 列, 动作数]
    Q = np.zeros((env.rows, env.cols, 4))
    # 记录每轮的步数和奖励(用于可视化)
    steps_per_episode = []
    rewards_per_episode = []

    for ep in range(episodes):
        state = env.reset()
        done = False
        steps = 0
        total_reward = 0

        while not done:
            x, y = state
            steps += 1

            # ε-贪心选择动作
            if np.random.uniform(0, 1) < epsilon:
                action = np.random.choice(4)  # 探索:随机动作
            else:
                action = np.argmax(Q[x, y, :])  # 利用:最优动作

            # 执行动作
            next_state, reward, done = env.step(action)
            nx, ny = next_state
            total_reward += reward

            # Q学习更新公式
            old_q = Q[x, y, action]
            max_next_q = np.max(Q[nx, ny, :])
            Q[x, y, action] = old_q + alpha * (reward + gamma * max_next_q - old_q)

            state = next_state

        steps_per_episode.append(steps)
        rewards_per_episode.append(total_reward)

    # 提取最优策略
    policy = np.argmax(Q, axis=2)
    return Q, policy, steps_per_episode, rewards_per_episode


# 3. 运行Q学习
Q_table, optimal_policy, steps, rewards = q_learning()

# 4. 可视化结果
# 4.1 每轮步数变化(学习曲线)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(steps, label="每轮步数", color='blue')
# 移动平均平滑
window = 10
smoothed_steps = np.convolve(steps, np.ones(window) / window, mode='valid')
plt.plot(range(window - 1, len(steps)), smoothed_steps, label=f"移动平均({window})", color='red')
plt.xlabel("训练轮数")
plt.ylabel("完成任务步数")
plt.title("Q学习步数变化(步数越少越好)")
plt.legend()
plt.grid(True)

# 4.2 每轮奖励变化
plt.subplot(1, 2, 2)
plt.plot(rewards, label="每轮奖励", color='green')
smoothed_rewards = np.convolve(rewards, np.ones(window) / window, mode='valid')
plt.plot(range(window - 1, len(rewards)), smoothed_rewards, label=f"移动平均({window})", color='orange')
plt.xlabel("训练轮数")
plt.ylabel("总奖励")
plt.title("Q学习奖励变化(奖励越高越好)")
plt.legend()
plt.grid(True)

# 4.3 最优策略可视化
plt.figure(figsize=(15, 4))
action_symbols = ['↑', '→', '↓', '←']
for i in range(4):
    for j in range(12):
        if (i, j) in CliffWalkingEnv().cliff:
            text = "悬崖"
        elif (i, j) == CliffWalkingEnv().goal:
            text = "终点"
        else:
            text = action_symbols[optimal_policy[i, j]]
        plt.text(j, 3 - i, text, ha='center', va='center', fontsize=10)

plt.xlim(-0.5, 11.5)
plt.ylim(-0.5, 3.5)
plt.xticks(range(12))
plt.yticks(range(4))
plt.grid(True)
plt.title("悬崖行走最优策略")
plt.gca().set_aspect('equal', adjustable='box')
plt.show()

# 输出关键结果
print("Q学习最优策略(动作符号:↑→↓←):")
action_symbols = ['↑', '→', '↓', '←']
policy_symbols = np.vectorize(lambda x: action_symbols[x])(optimal_policy)
print(policy_symbols)
运行效果
  • 图 1:左右两个子图分别展示步数和奖励的变化,能看到步数逐渐减少、奖励逐渐升高
  • 图 2:悬崖行走的最优策略可视化,AI 会绕开悬崖区域,找到从起点到终点的最优路径

6.3 示范强化学习

6.3.1 模仿强化学习

核心思想

模仿学习(Imitation Learning):智能体通过模仿专家的行为来学习策略,核心是 "从示范中学习",适合奖励函数难以定义的场景。

完整代码实现(模仿专家走迷宫)
复制代码
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = ["SimHei", "WenQuanYi Micro Hei", "Heiti TC"]

# 1. 定义迷宫环境
class MazeEnv:
    def __init__(self):
        # 迷宫地图:0=通路,1=墙壁,2=起点,3=终点
        self.maze = np.array([
            [2, 1, 0, 0, 0],
            [0, 1, 0, 1, 0],
            [0, 0, 0, 1, 0],
            [1, 1, 0, 1, 0],
            [0, 0, 0, 0, 3]
        ])
        self.rows, self.cols = self.maze.shape
        # 起点和终点坐标
        self.start = (np.where(self.maze == 2)[0][0], np.where(self.maze == 2)[1][0])
        self.goal = (np.where(self.maze == 3)[0][0], np.where(self.maze == 3)[1][0])
        self.current_state = self.start
    
    def reset(self):
        self.current_state = self.start
        return self.current_state
    
    def step(self, action):
        """动作:0=上,1=右,2=下,3=左"""
        x, y = self.current_state
        
        # 执行动作(不能穿墙)
        if action == 0 and x > 0 and self.maze[x-1, y] != 1:
            x -= 1
        elif action == 1 and y < self.cols-1 and self.maze[x, y+1] != 1:
            y += 1
        elif action == 2 and x < self.rows-1 and self.maze[x+1, y] != 1:
            x += 1
        elif action == 3 and y > 0 and self.maze[x, y-1] != 1:
            y -= 1
        
        new_state = (x, y)
        self.current_state = new_state
        
        reward = 10 if new_state == self.goal else -0.1
        done = new_state == self.goal
        return new_state, reward, done

# 2. 生成专家示范(手动定义最优路径)
def generate_expert_demos():
    env = MazeEnv()
    # 专家路径:起点→下→下→右→右→下→下→右→右→上→右(到达终点)
    expert_actions = [2, 2, 1, 1, 2, 2, 1, 1, 0, 1]
    demos = []
    
    state = env.reset()
    demos.append((state, expert_actions[0]))
    for a in expert_actions:
        next_state, _, _ = env.step(a)
        if len(demos) < len(expert_actions):
            demos.append((next_state, expert_actions[len(demos)]))
    
    return demos

# 3. 模仿学习(行为克隆)
def behavior_cloning(demos, epochs=100, lr=0.01):
    # 简单的线性模型:状态→动作概率
    # 状态编码:(x,y) → x*cols + y
    env = MazeEnv()
    n_states = env.rows * env.cols
    n_actions = 4
    # 模型参数
    weights = np.zeros((n_states, n_actions))
    
    # 训练行为克隆模型
    loss_history = []
    for epoch in range(epochs):
        total_loss = 0
        for (state, action) in demos:
            x, y = state
            s_idx = x * env.cols + y
            
            # 计算动作概率(softmax)
            logits = weights[s_idx]
            probs = np.exp(logits) / np.sum(np.exp(logits))
            
            # 交叉熵损失
            loss = -np.log(probs[action] + 1e-8)
            total_loss += loss
            
            # 梯度下降更新
            probs[action] -= 1
            weights[s_idx] -= lr * probs
        
        loss_history.append(total_loss / len(demos))
    
    # 提取模仿策略
    policy = np.argmax(weights, axis=1).reshape(env.rows, env.cols)
    return policy, loss_history

# 4. 测试模仿学习效果
def test_imitation_policy(policy):
    env = MazeEnv()
    state = env.reset()
    done = False
    steps = 0
    path = [state]
    
    while not done and steps < 50:
        x, y = state
        action = policy[x, y]
        next_state, _, done = env.step(action)
        path.append(next_state)
        state = next_state
        steps += 1
    
    return path, steps

# 5. 主流程
# 生成专家示范
expert_demos = generate_expert_demos()
# 行为克隆训练
imitation_policy, loss_history = behavior_cloning(expert_demos)
# 测试模仿策略
path, steps = test_imitation_policy(imitation_policy)

# 6. 可视化
# 6.1 损失变化
plt.figure(figsize=(10, 5))
plt.plot(loss_history, label="行为克隆损失", color='red')
plt.xlabel("训练轮数")
plt.ylabel("平均交叉熵损失")
plt.title("模仿学习损失收敛过程")
plt.legend()
plt.grid(True)

# 6.2 迷宫+模仿路径可视化
plt.figure(figsize=(8, 8))
env = MazeEnv()
# 绘制迷宫
for i in range(env.rows):
    for j in range(env.cols):
        if env.maze[i, j] == 1:
            plt.fill_between([j, j+1], env.rows-i-1, env.rows-i, color='black')
        elif (i, j) == env.start:
            plt.fill_between([j, j+1], env.rows-i-1, env.rows-i, color='green')
        elif (i, j) == env.goal:
            plt.fill_between([j, j+1], env.rows-i-1, env.rows-i, color='red')

# 绘制模仿路径
path_x = [p[1]+0.5 for p in path]
path_y = [env.rows - p[0] - 0.5 for p in path]
plt.plot(path_x, path_y, 'b-', linewidth=2, label="模仿路径")

plt.xlim(0, env.cols)
plt.ylim(0, env.rows)
plt.xticks([])
plt.yticks([])
plt.title(f"模仿学习路径(步数:{steps})")
plt.legend()
plt.show()

print("模仿学习最优策略:")
print(imitation_policy)

6.3.2 逆向强化学习

核心思想

逆向强化学习(IRL):从专家行为中反推奖励函数,再用强化学习学习最优策略。核心是 "先找奖励,再学策略",适合奖励函数未知的场景。

完整代码实现(简化版 IRL)
复制代码
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = ["SimHei", "WenQuanYi Micro Hei", "Heiti TC"]

# 1. 复用6.3.1的迷宫环境
from imitation_learning import MazeEnv, generate_expert_demos  # 复用之前的环境和专家示范

# 2. 逆向强化学习(最大边际IRL)
def irl_max_margin(demos, epochs=200, lr=0.001):
    env = MazeEnv()
    n_states = env.rows * env.cols
    # 初始化奖励函数参数
    reward_weights = np.random.randn(n_states) * 0.1
    
    # 专家轨迹的状态集合
    expert_states = [s for s, a in demos]
    expert_state_ids = [s[0]*env.cols + s[1] for s in expert_states]
    
    loss_history = []
    for epoch in range(epochs):
        # 1. 用当前奖励函数计算所有状态的价值(值迭代)
        V = np.zeros(n_states)
        theta = 1e-4
        gamma = 0.9
        for _ in range(10):  # 值迭代步数
            V_new = np.zeros_like(V)
            for s_idx in range(n_states):
                x = s_idx // env.cols
                y = s_idx % env.cols
                if (x, y) == env.goal:
                    V_new[s_idx] = 0
                    continue
                # 计算所有动作的价值
                action_values = []
                for a in range(4):
                    # 模拟动作
                    nx, ny = x, y
                    if a == 0 and x > 0 and env.maze[x-1, y] != 1:
                        nx -= 1
                    elif a == 1 and y < env.cols-1 and env.maze[x, y+1] != 1:
                        ny += 1
                    elif a == 2 and x < env.rows-1 and env.maze[x+1, y] != 1:
                        nx += 1
                    elif a == 3 and y > 0 and env.maze[x, y-1] != 1:
                        ny -= 1
                    ns_idx = nx * env.cols + ny
                    action_values.append(reward_weights[s_idx] + gamma * V[ns_idx])
                V_new[s_idx] = max(action_values)
            if np.max(np.abs(V_new - V)) < theta:
                break
            V = V_new
        
        # 2. 计算损失(专家轨迹价值 - 随机轨迹价值 最大化)
        expert_value = np.mean([V[s_idx] for s_idx in expert_state_ids])
        # 随机轨迹(非专家状态)
        random_state_ids = [i for i in range(n_states) if i not in expert_state_ids]
        random_value = np.mean([V[s_idx] for s_idx in random_state_ids])
        # 最大边际损失
        loss = max(0, 1 - (expert_value - random_value))
        loss_history.append(loss)
        
        # 3. 梯度更新奖励函数
        grad = np.zeros_like(reward_weights)
        # 专家状态梯度
        for s_idx in expert_state_ids:
            grad[s_idx] += 1 / len(expert_state_ids)
        # 随机状态梯度
        for s_idx in random_state_ids:
            grad[s_idx] -= 1 / len(random_state_ids)
        
        reward_weights += lr * grad
        reward_weights = np.clip(reward_weights, 0, 1)  # 限制奖励范围
    
    # 反推的奖励函数
    reward_func = reward_weights.reshape(env.rows, env.cols)
    return reward_func, loss_history

# 3. 用反推的奖励函数做Q学习
def q_learning_with_irl_reward(reward_func):
    env = MazeEnv()
    Q = np.zeros((env.rows, env.cols, 4))
    alpha = 0.1
    gamma = 0.9
    epsilon = 0.1
    steps_history = []
    
    for ep in range(200):
        state = env.reset()
        done = False
        steps = 0
        
        while not done and steps < 50:
            x, y = state
            # ε-贪心选动作
            if np.random.uniform() < epsilon:
                action = np.random.choice(4)
            else:
                action = np.argmax(Q[x, y])
            
            # 执行动作
            next_state, _, done = env.step(action)
            nx, ny = next_state
            steps += 1
            
            # 用IRL反推的奖励
            reward = reward_func[x, y]
            if next_state == env.goal:
                reward += 10
                done = True
            
            # Q学习更新
            Q[x, y, action] += alpha * (reward + gamma * np.max(Q[nx, ny]) - Q[x, y, action])
            state = next_state
        
        steps_history.append(steps)
    
    policy = np.argmax(Q, axis=2)
    return policy, steps_history

# 4. 主流程
# 生成专家示范
expert_demos = generate_expert_demos()
# IRL反推奖励函数
reward_func, loss_history = irl_max_margin(expert_demos)
# 用IRL奖励做Q学习
irl_policy, steps_history = q_learning_with_irl_reward(reward_func)

# 5. 可视化
# 5.1 IRL损失变化
plt.figure(figsize=(10, 5))
plt.plot(loss_history, label="IRL损失", color='purple')
plt.xlabel("训练轮数")
plt.ylabel("最大边际损失")
plt.title("逆向强化学习损失收敛")
plt.legend()
plt.grid(True)

# 5.2 反推的奖励函数可视化
plt.figure(figsize=(8, 8))
env = MazeEnv()
for i in range(env.rows):
    for j in range(env.cols):
        if env.maze[i, j] == 1:
            color = 'black'
            text = '墙'
        else:
            color = plt.cm.Reds(reward_func[i, j])
            text = f"{reward_func[i, j]:.2f}"
        plt.fill_between([j, j+1], env.rows-i-1, env.rows-i, color=color)
        plt.text(j+0.5, env.rows-i-0.5, text, ha='center', va='center', color='white' if reward_func[i, j]>0.5 else 'black')

plt.xlim(0, env.cols)
plt.ylim(0, env.rows)
plt.xticks([])
plt.yticks([])
plt.title("IRL反推的奖励函数分布(颜色越深奖励越高)")
plt.show()

# 5.3 IRL+Q学习步数变化
plt.figure(figsize=(10, 5))
plt.plot(steps_history, label="每轮步数", color='blue')
smoothed = np.convolve(steps_history, np.ones(5)/5, mode='valid')
plt.plot(range(4, len(steps_history)), smoothed, label="移动平均", color='red')
plt.xlabel("训练轮数")
plt.ylabel("完成任务步数")
plt.title("IRL+Q学习步数变化")
plt.legend()
plt.grid(True)
plt.show()

print("IRL反推的奖励函数:")
print(reward_func)

6.4 强化学习应用

6.4.1 自动爬山小车

核心思想

使用 Q 学习训练小车自动爬山(经典 OpenAI Gym 环境),核心是状态离散化 + Q 表学习。

完整代码实现
python 复制代码
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt

# 解决中文和负号显示问题
plt.rcParams["font.family"] = ["SimHei", "Microsoft YaHei"]
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['font.sans-serif'] = ['SimHei']


# 1. 环境初始化(适配Gymnasium)
def init_env():
    """初始化爬山小车环境"""
    env = gym.make('MountainCar-v0')
    # 打印环境信息,方便理解状态/动作空间
    print(f"状态空间范围:位置={env.observation_space.low[0]:.2f}~{env.observation_space.high[0]:.2f}, "
          f"速度={env.observation_space.low[1]:.4f}~{env.observation_space.high[1]:.4f}")
    print(f"动作空间:{env.action_space.n}个动作(0=左, 1=不动, 2=右)")
    return env


# 2. 状态离散化(兼容新旧版本返回格式)
def discretize_state(state, bins):
    """
    将连续状态离散化
    :param state: 环境返回的状态(处理tuple/array两种格式)
    :param bins: 离散化的区间
    :return: 离散后的状态索引 (pos_bin, vel_bin)
    """
    # 处理Gymnasium返回的(state, info)元组
    if isinstance(state, tuple):
        state = state[0]

    # 确保state是数组格式
    state = np.array(state, dtype=np.float32)
    pos, vel = state[0], state[1]

    # 离散化(限制索引范围,避免越界)
    pos_bin = np.clip(np.digitize(pos, bins[0]) - 1, 0, len(bins[0]) - 1)
    vel_bin = np.clip(np.digitize(vel, bins[1]) - 1, 0, len(bins[1]) - 1)

    return (pos_bin, vel_bin)


# 3. 创建离散化的bins(优化区间划分)
def create_bins(n_bins=20):
    """
    创建状态离散化的区间
    :param n_bins: 每个维度的分箱数
    :return: [位置bins, 速度bins]
    """
    # 严格匹配MountainCar-v0的状态范围
    pos_min, pos_max = -1.2, 0.6
    vel_min, vel_max = -0.07, 0.07

    # 创建等距区间
    pos_bins = np.linspace(pos_min, pos_max, n_bins)
    vel_bins = np.linspace(vel_min, vel_max, n_bins)

    return [pos_bins, vel_bins]


# 4. Q学习算法(完整鲁棒版)
def mountain_car_q_learning(episodes=1000, alpha=0.1, gamma=0.95, epsilon=0.1):
    env = init_env()
    n_bins = 20
    bins = create_bins(n_bins)

    # 初始化Q表:[位置bins, 速度bins, 动作数]
    n_actions = env.action_space.n
    Q = np.zeros((n_bins, n_bins, n_actions), dtype=np.float32)

    # 记录训练过程
    steps_per_episode = []
    rewards_per_episode = []

    for ep in range(episodes):
        # 重置环境(适配Gymnasium的返回格式)
        state, _ = env.reset()
        done = False
        truncated = False  # 新增:Gymnasium的截断标志
        steps = 0
        total_reward = 0

        while not (done or truncated):
            steps += 1

            # 状态离散化
            disc_state = discretize_state(state, bins)
            x, y = disc_state

            # ε-贪心选动作(添加随机种子保证可复现)
            if np.random.uniform(0, 1) < epsilon:
                action = env.action_space.sample()  # 用环境自带的随机采样
            else:
                # 解决Q值相同的情况:随机选最优动作
                q_values = Q[x, y]
                max_q = np.max(q_values)
                best_actions = np.where(q_values == max_q)[0]
                action = np.random.choice(best_actions)

            # 执行动作(适配Gymnasium的返回格式)
            next_state, reward, done, truncated, _ = env.step(action)
            total_reward += reward

            # 离散化下一个状态
            disc_next_state = discretize_state(next_state, bins)
            nx, ny = disc_next_state

            # Q学习核心更新公式
            old_q = Q[x, y, action]
            max_next_q = np.max(Q[nx, ny]) if not (done or truncated) else 0
            Q[x, y, action] = old_q + alpha * (reward + gamma * max_next_q - old_q)

            state = next_state

            # 限制最大步数(防止无限循环)
            if steps > 200:
                truncated = True

        # 记录本轮数据
        steps_per_episode.append(steps)
        rewards_per_episode.append(total_reward)

        # 每100轮打印进度
        if (ep + 1) % 100 == 0:
            avg_steps = np.mean(steps_per_episode[-100:])
            avg_reward = np.mean(rewards_per_episode[-100:])
            print(f"第{ep + 1}轮 | 最近100轮平均步数:{avg_steps:.2f} | 平均奖励:{avg_reward:.2f}")

    env.close()
    return Q, steps_per_episode, rewards_per_episode


# 5. 测试训练好的模型(可视化验证)
def test_mountain_car(Q):
    """测试训练好的Q表,可视化运行过程"""
    # 创建带渲染的环境
    env = gym.make('MountainCar-v0', render_mode='human')
    n_bins = 20
    bins = create_bins(n_bins)

    # 重置环境
    state, _ = env.reset()
    done = False
    truncated = False
    steps = 0

    while not (done or truncated) and steps < 200:
        # 渲染画面
        env.render()

        # 选择最优动作
        disc_state = discretize_state(state, bins)
        x, y = disc_state
        action = np.argmax(Q[x, y])

        # 执行动作
        next_state, _, done, truncated, _ = env.step(action)
        state = next_state
        steps += 1

    env.close()
    print(f"\n测试完成 | 到达山顶步数:{steps}" if done else f"\n测试完成 | 未到达山顶(步数上限):{steps}")
    return steps


# 6. 可视化训练结果
def plot_training_results(steps, rewards):
    """可视化步数和奖励变化"""
    plt.figure(figsize=(12, 5))

    # 子图1:步数变化
    plt.subplot(1, 2, 1)
    plt.plot(steps, label="每轮步数", color='blue', alpha=0.5)
    # 移动平均平滑
    window = 20
    smoothed_steps = np.convolve(steps, np.ones(window) / window, mode='valid')
    plt.plot(range(window - 1, len(steps)), smoothed_steps,
             label=f"移动平均({window})", color='red', linewidth=2)
    plt.xlabel("训练轮数")
    plt.ylabel("完成任务步数")
    plt.title("爬山小车Q学习步数变化(越少越好)")
    plt.legend()
    plt.grid(True, alpha=0.3)

    # 子图2:奖励变化
    plt.subplot(1, 2, 2)
    plt.plot(rewards, label="每轮奖励", color='green', alpha=0.5)
    smoothed_rewards = np.convolve(rewards, np.ones(window) / window, mode='valid')
    plt.plot(range(window - 1, len(rewards)), smoothed_rewards,
             label=f"移动平均({window})", color='orange', linewidth=2)
    plt.xlabel("训练轮数")
    plt.ylabel("总奖励")
    plt.title("爬山小车Q学习奖励变化(越高越好)")
    plt.legend()
    plt.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()


# 7. 主程序入口
if __name__ == "__main__":
    # 安装依赖(如果未安装)
    # !pip install gymnasium numpy matplotlib

    # 训练Q学习模型
    print("开始训练爬山小车Q学习模型...")
    Q_table, steps, rewards = mountain_car_q_learning(episodes=1000)

    # 测试模型(弹出可视化窗口)
    print("\n开始测试训练好的模型...")
    test_steps = test_mountain_car(Q_table)

    # 可视化训练结果
    plot_training_results(steps, rewards)
运行前置条件
复制代码
# 安装依赖
pip install gym==0.26.2 numpy matplotlib
运行效果
  • 训练过程中步数逐渐减少,奖励逐渐升高
  • 测试时会弹出小车爬山的可视化窗口,能看到小车成功爬到山顶

6.4.2 五子棋自动对弈

核心思想

用 Q 学习训练五子棋 AI,核心是状态表示(棋盘编码)+ Q 值更新 + 落子策略。

完整代码实现
复制代码
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
plt.rcParams["font.family"] = ["SimHei", "WenQuanYi Micro Hei", "Heiti TC"]

# 1. 五子棋环境实现
class GomokuEnv:
    def __init__(self, board_size=10):
        self.board_size = board_size
        self.board = np.zeros((board_size, board_size), dtype=int)  # 0=空,1=玩家1,2=玩家2
        self.current_player = 1  # 当前下棋玩家
        self.done = False
        self.winner = None
    
    def reset(self):
        """重置棋盘"""
        self.board = np.zeros((self.board_size, self.board_size), dtype=int)
        self.current_player = 1
        self.done = False
        self.winner = None
        return self.board.copy()
    
    def is_win(self, player):
        """判断玩家是否获胜"""
        # 检查横向
        for i in range(self.board_size):
            for j in range(self.board_size - 4):
                if all(self.board[i, j:j+5] == player):
                    return True
        # 检查纵向
        for i in range(self.board_size - 4):
            for j in range(self.board_size):
                if all(self.board[i:i+5, j] == player):
                    return True
        # 检查正斜线
        for i in range(self.board_size - 4):
            for j in range(self.board_size - 4):
                if all(self.board[i+k, j+k] == player for k in range(5)):
                    return True
        # 检查反斜线
        for i in range(4, self.board_size):
            for j in range(self.board_size - 4):
                if all(self.board[i-k, j+k] == player for k in range(5)):
                    return True
        return False
    
    def step(self, action):
        """执行落子动作:action=(x,y)"""
        x, y = action
        
        # 检查动作是否合法
        if self.board[x, y] != 0 or self.done:
            reward = -10  # 非法落子惩罚
            return self.board.copy(), reward, self.done, self.winner
        
        # 落子
        self.board[x, y] = self.current_player
        
        # 检查是否获胜
        if self.is_win(self.current_player):
            self.done = True
            self.winner = self.current_player
            reward = 100  # 获胜奖励
        # 检查是否平局
        elif np.all(self.board != 0):
            self.done = True
            self.winner = 0
            reward = 0  # 平局奖励
        else:
            reward = -1  # 每步惩罚(鼓励尽快获胜)
            # 切换玩家
            self.current_player = 2 if self.current_player == 1 else 1
        
        return self.board.copy(), reward, self.done, self.winner
    
    def render(self):
        """可视化棋盘"""
        fig, ax = plt.subplots(figsize=(8, 8))
        ax.set_xlim(0, self.board_size)
        ax.set_ylim(0, self.board_size)
        
        # 绘制棋盘网格
        for i in range(1, self.board_size):
            ax.axhline(y=i, color='black', linewidth=0.5)
            ax.axvline(x=i, color='black', linewidth=0.5)
        
        # 绘制棋子
        for i in range(self.board_size):
            for j in range(self.board_size):
                if self.board[i, j] == 1:
                    # 黑棋
                    circle = patches.Circle((j+0.5, self.board_size - i - 0.5), 0.4, color='black')
                    ax.add_patch(circle)
                elif self.board[i, j] == 2:
                    # 白棋
                    circle = patches.Circle((j+0.5, self.board_size - i - 0.5), 0.4, color='white', edgecolor='black')
                    ax.add_patch(circle)
        
        ax.set_aspect('equal')
        plt.xticks([])
        plt.yticks([])
        plt.title(f"五子棋(当前玩家:{self.current_player},胜者:{self.winner if self.winner is not None else '无'})")
        plt.show()

# 2. 五子棋Q学习AI
class GomokuQAgent:
    def __init__(self, board_size=10, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.board_size = board_size
        self.alpha = alpha  # 学习率
        self.gamma = gamma  # 折扣因子
        self.epsilon = epsilon  # 探索率
        # Q表:状态哈希值 → 动作Q值
        self.Q = {}
        # 状态编码(简化:只记录最近几步)
        self.state_history = []
    
    def encode_state(self, board):
        """状态编码:将棋盘转换为哈希值"""
        return hash(board.tobytes())
    
    def get_q_value(self, state_hash, action):
        """获取Q值,不存在则返回0"""
        action_key = (action[0], action[1])
        if state_hash not in self.Q:
            self.Q[state_hash] = {}
        return self.Q[state_hash].get(action_key, 0.0)
    
    def update_q_value(self, state_hash, action, value):
        """更新Q值"""
        action_key = (action[0], action[1])
        if state_hash not in self.Q:
            self.Q[state_hash] = {}
        self.Q[state_hash][action_key] = value
    
    def choose_action(self, board, training=True):
        """选择动作(ε-贪心)"""
        # 获取所有合法动作
        legal_actions = [(i, j) for i in range(self.board_size) for j in range(self.board_size) if board[i, j] == 0]
        
        if not legal_actions:
            return None
        
        # 训练时探索,测试时利用
        if training and np.random.uniform(0, 1) < self.epsilon:
            return np.random.choice(legal_actions)
        
        # 选择最优动作
        state_hash = self.encode_state(board)
        q_values = [self.get_q_value(state_hash, a) for a in legal_actions]
        max_q = max(q_values)
        # 有多个最优动作时随机选
        best_actions = [a for a, q in zip(legal_actions, q_values) if q == max_q]
        return np.random.choice(best_actions)
    
    def learn(self, state, action, reward, next_state, done):
        """Q学习更新"""
        state_hash = self.encode_state(state)
        next_state_hash = self.encode_state(next_state)
        
        # 当前Q值
        old_q = self.get_q_value(state_hash, action)
        
        # 计算目标Q值
        if done:
            target_q = reward
        else:
            # 获取下一个状态的最优Q值
            next_legal_actions = [(i, j) for i in range(self.board_size) for j in range(self.board_size) if next_state[i, j] == 0]
            if next_legal_actions:
                next_q_values = [self.get_q_value(next_state_hash, a) for a in next_legal_actions]
                max_next_q = max(next_q_values)
            else:
                max_next_q = 0
            target_q = reward + self.gamma * max_next_q
        
        # 更新Q值
        new_q = old_q + self.alpha * (target_q - old_q)
        self.update_q_value(state_hash, action, new_q)

# 3. 训练五子棋AI
def train_gomoku_agent(episodes=1000):
    board_size = 8  # 缩小棋盘加快训练
    env = GomokuEnv(board_size)
    agent1 = GomokuQAgent(board_size)  # 玩家1(黑棋)
    agent2 = GomokuQAgent(board_size)  # 玩家2(白棋)
    
    # 记录训练过程
    win_counts = {1: 0, 2: 0, 0: 0}  # 1=玩家1胜,2=玩家2胜,0=平局
    episode_rewards = []
    
    for ep in range(episodes):
        state = env.reset()
        done = False
        total_reward = 0
        
        while not done:
            current_agent = agent1 if env.current_player == 1 else agent2
            # 选择动作
            action = current_agent.choose_action(state)
            if action is None:
                break
            # 执行动作
            next_state, reward, done, winner = env.step(action)
            total_reward += reward
            # 学习
            current_agent.learn(state, action, reward, next_state, done)
            # 更新状态
            state = next_state
        
        # 记录结果
        if winner in win_counts:
            win_counts[winner] += 1
        episode_rewards.append(total_reward)
        
        # 每100轮打印进度
        if (ep+1) % 100 == 0:
            print(f"第{ep+1}轮,胜负统计:玩家1胜{win_counts[1]},玩家2胜{win_counts[2]},平局{win_counts[0]}")
            # 降低探索率
            agent1.epsilon = max(0.01, agent1.epsilon * 0.95)
            agent2.epsilon = max(0.01, agent2.epsilon * 0.95)
    
    return agent1, agent2, win_counts, episode_rewards

# 4. 人机对弈
def play_vs_agent(agent):
    board_size = 8
    env = GomokuEnv(board_size)
    state = env.reset()
    env.render()
    
    while not env.done:
        if env.current_player == 1:
            # 人类玩家落子
            print("\n你的回合(黑棋),请输入落子坐标(x,y),范围0-7:")
            while True:
                try:
                    x, y = map(int, input().split(','))
                    if 0 <= x < board_size and 0 <= y < board_size and env.board[x, y] == 0:
                        break
                    else:
                        print("坐标不合法,请重新输入!")
                except:
                    print("输入格式错误,请输入 x,y 格式!")
            action = (x, y)
        else:
            # AI落子
            print("\nAI回合(白棋),思考中...")
            action = agent.choose_action(state, training=False)
            print(f"AI落子:{action}")
        
        # 执行动作
        next_state, reward, done, winner = env.step(action)
        env.render()
        state = next_state
    
    # 显示结果
    if winner == 1:
        print("恭喜你获胜!")
    elif winner == 2:
        print("AI获胜!")
    else:
        print("平局!")

# 5. 主流程
# 训练AI(注:1000轮训练约需要几分钟)
print("开始训练五子棋AI...")
agent1, agent2, win_counts, episode_rewards = train_gomoku_agent(episodes=1000)

# 可视化训练结果
plt.figure(figsize=(12, 5))
# 奖励变化
plt.subplot(1, 2, 1)
plt.plot(episode_rewards, label="每轮奖励", color='blue', alpha=0.5)
window = 20
smoothed = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
plt.plot(range(window-1, len(episode_rewards)), smoothed, label=f"移动平均({window})", color='red')
plt.xlabel("训练轮数")
plt.ylabel("总奖励")
plt.title("五子棋Q学习奖励变化")
plt.legend()
plt.grid(True)

# 胜负统计
plt.subplot(1, 2, 2)
labels = ['玩家1胜', '玩家2胜', '平局']
counts = [win_counts[1], win_counts[2], win_counts[0]]
plt.bar(labels, counts, color=['black', 'white', 'gray'], edgecolor='black')
plt.xlabel("比赛结果")
plt.ylabel("次数")
plt.title("五子棋AI训练胜负统计")
plt.grid(True, axis='y')
plt.show()

# 人机对弈
print("\n开始人机对弈!")
play_vs_agent(agent2)
运行效果
  • 训练过程中 AI 的胜率逐渐提升
  • 训练完成后可与人机对弈,AI 会自动落子,支持可视化棋盘

6.5 习题

基础题

  1. 解释强化学习中 "探索" 与 "利用" 的权衡,举例说明 ε- 贪心策略如何平衡二者。
  2. 对比值迭代和策略迭代的核心区别,手动计算 4x4 网格世界的前 3 步值迭代。
  3. 推导 Q 学习的更新公式,说明为什么 Q 学习是异策略算法。

编程题

  1. 修改 6.2.3 的 Q 学习代码,实现 Double Q-Learning,对比普通 Q 学习和 Double Q-Learning 的收敛速度。
  2. 在 6.4.1 的爬山小车代码中,尝试不同的状态离散化粒度(10/20/30 bins),对比训练效果。
  3. 改进 6.4.2 的五子棋 AI,添加开局库和杀棋检测,提升 AI 的对弈水平。

思考题

  1. 强化学习在实际应用中面临哪些挑战(如样本效率、探索爆炸等)?有哪些解决方案?
  2. 对比监督学习和强化学习的适用场景,举例说明何时选择强化学习更合适。
  3. 结合自己的研究 / 工作方向,思考强化学习可以解决哪些具体问题。

三、总结

关键点回顾

  1. 核心概念:强化学习的核心是智能体通过与环境交互,最大化累积奖励,关键要素包括状态、动作、奖励、策略,马尔可夫性是核心假设。
  2. 基础算法:值迭代适合小规模 MDP,时序差分(TD)无需完整轨迹,Q 学习是最经典的无模型强化学习算法,支持异策略学习。
  3. 实战应用:强化学习可解决连续控制(爬山小车)、博弈决策(五子棋)等问题,核心是状态表示 + 奖励设计 + 策略优化。

📌 本文所有代码均可直接运行,建议读者先运行基础算法代码理解核心逻辑,再尝试修改参数、扩展功能,加深对强化学习的理解。如果有问题,欢迎在评论区交流!


博主简介:专注机器学习与强化学习实战,分享通俗易懂的算法教程和完整代码实现。原创不易,转载请注明出处!如果本文对你有帮助,欢迎点赞、收藏、关注~

相关推荐
人工智能AI技术2 小时前
【Agent从入门到实践】21 Prompt工程基础:为Agent设计“思考指令”,简单有效即可
人工智能·python
Stardep2 小时前
算法入门20——二分查找算法——搜索插入位置
数据结构·算法·leetcode
qwerasda1238522 小时前
青豆质量分类识别_YOLOv5_SPDConv_改进算法_目标检测_深度学习_计算机视觉
算法·计算机视觉·分类
式5162 小时前
大模型学习基础(九)LoRA微调原理
人工智能·深度学习·学习
CCPC不拿奖不改名2 小时前
python基础面试编程题汇总+个人练习(入门+结构+函数+面向对象编程)--需要自取
开发语言·人工智能·python·学习·自然语言处理·面试·职场和发展
菜鸟‍2 小时前
【论文学习】一种用于医学图像分割单源域泛化的混合双增强约束框架 || 视觉 Transformer 在通用图像分割中的 “缺失环节”
人工智能·深度学习·计算机视觉
老鼠只爱大米2 小时前
LeetCode经典算法面试题 #141:环形链表(快慢指针、标记节点等多种方法详细解析)
算法·leetcode·链表·快慢指针·floyd算法·环形链表
五度易链-区域产业数字化管理平台2 小时前
数观丨2026年半导体集成电路产业融资分析
大数据·人工智能
应用市场2 小时前
机器学习中的正向反馈循环:从原理到实战应用
人工智能·深度学习·机器学习