从gym到gymnasium的倒立摆

新版 NumPy 移除了 np.bool8np.int64 等简写别名(因为它们容易和 Python 原生类型混淆),推荐使用 np.bool_np.int_ 等。但是旧版 gym 的代码里写死了 np.bool8,所以运行时会报错。上面的代码手动把这些别名加回去,就能让 gym 正常工作。

gym版本的倒立摆,当初我2023年为了跑通它,恰好逢其改版,还得降numpy的版本......

python 复制代码
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gym
from collections import deque
import random

# -------------------------
# 1. 超参数设置
# -------------------------
BATCH_SIZE = 32      # 每次训练取样的数量
GAMMA = 0.99         # 折扣因子(对未来的奖励打多少折)
EPSILON_START = 1.0  # 初始探索率(100% 随机动作)
EPSILON_END = 0.01   # 最终探索率
EPSILON_DECAY = 500  # 探索率衰减速度
MEMORY_SIZE = 10000  # 记忆库大小
TARGET_UPDATE = 10   # 每隔多少步更新一次目标网络
HIDDEN_SIZE = 64     # 神经网络隐藏层大小
LR = 1e-3            # 学习率
EPISODES = 500       # 训练回合数

# -------------------------
# 2. 简单的神经网络 (Q网络)
# -------------------------
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, HIDDEN_SIZE),
            nn.ReLU(),
            nn.Linear(HIDDEN_SIZE, output_dim)
        )

    def forward(self, x):
        return self.net(x)

# -------------------------
# 3. 经验回放
# -------------------------
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = map(np.stack, zip(*batch))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

# -------------------------
# 4. 训练主循环
# -------------------------
def train():
    # 创建环境
    env = gym.make('CartPole-v1')
    
    # 获取状态和动作的维度
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    # 初始化网络
    policy_net = DQN(state_dim, action_dim) # 策略网络(用于选择动作)
    target_net = DQN(state_dim, action_dim) # 目标网络(用于计算目标Q值)
    target_net.load_state_dict(policy_net.state_dict()) # 初始时两者参数相同
    target_net.eval() # 目标网络不需要训练

    optimizer = optim.Adam(policy_net.parameters(), lr=LR)
    memory = ReplayBuffer(MEMORY_SIZE)
    
    steps_done = 0

    for episode in range(EPISODES):
        state = env.reset()
        total_reward = 0
        
        while True:
            # --- 选择动作 ---
            # Epsilon-Greedy 策略:随着训练进行,逐渐减少随机探索
            epsilon = EPSILON_END + (EPSILON_START - EPSILON_END) * \
                      np.exp(-1. * steps_done / EPSILON_DECAY)
            
            if random.random() < epsilon:
                action = env.action_space.sample() # 随机动作
            else:
                # 将状态转为Tensor并输入网络,选择Q值最大的动作
                with torch.no_grad():
                    state_tensor = torch.FloatTensor(state).unsqueeze(0)
                    q_values = policy_net(state_tensor)
                    action = q_values.argmax().item()

            # --- 执行动作 ---
            next_state, reward, done, _ = env.step(action)
            total_reward += reward
            
            # --- 存储经验 ---
            memory.push(state, action, reward, next_state, done)

            # --- 训练网络 ---
            if len(memory) >= BATCH_SIZE:
                # 从记忆库中采样
                states, actions, rewards, next_states, dones = memory.sample(BATCH_SIZE)
                
                # 转换为Tensor
                states = torch.FloatTensor(states)
                actions = torch.LongTensor(actions).unsqueeze(1)
                rewards = torch.FloatTensor(rewards).unsqueeze(1)
                next_states = torch.FloatTensor(next_states)
                dones = torch.FloatTensor(dones).unsqueeze(1)

                # 计算当前Q值
                current_q_values = policy_net(states).gather(1, actions)
                
                # 计算目标Q值
                # Q_target = r + gamma * max(Q_target_net(s'))
                with torch.no_grad():
                    max_next_q_values = target_net(next_states).max(1)[0].unsqueeze(1)
                    target_q_values = rewards + GAMMA * max_next_q_values * (1 - dones)

                # 计算损失
                loss = nn.MSELoss()(current_q_values, target_q_values)

                # 反向传播更新参数
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            state = next_state
            steps_done += 1

            if done:
                break

        # --- 更新目标网络 ---
        if episode % TARGET_UPDATE == 0:
            target_net.load_state_dict(policy_net.state_dict())

        # 打印进度
        if episode % 10 == 0:
            print(f"Episode: {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.2f}")

    print("训练完成!")
    env.close()

if __name__ == "__main__":
    train()

如果现在用,就会出现module 'numpy' has no attribute 'bool8'的问题,

这个错误 AttributeError: module 'numpy' has no attribute 'bool8' 通常是因为 NumPy 版本更新导致的兼容性问题

原因分析

NumPy 1.20 版本中,numpy.bool8 被标记为"已弃用",而在 NumPy 1.24 及更高版本中,它被完全移除了。

如果你使用的代码(或者某个第三方库)中仍然引用了 numpy.bool8,而你的环境安装了较新的 NumPy,就会报这个错。

今天为了再测试DQN,想起来问智谱GLM,竟然全给我改好了!

从2022年刚开始能生成一些功能代码,到现在都可以自己搜索和粘合了,知识和信息的技术进步真的太快了,太棒了!我甚至都不用去搜索和找配对的问题,再去理解其答案配对到自己的问题,现在的大模型已经可以一键解决了!

这意味着,未来甚至不同语言都不用自己去学习了,距离完全的"意思流动"也越来越近了!

(代码注释上标了几个修改点,就是改版后的)

python 复制代码
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
from collections import deque
import random

# -------------------------
# 1. 超参数设置
# -------------------------
BATCH_SIZE = 32      # 每次训练取样的数量
GAMMA = 0.99         # 折扣因子(对未来的奖励打多少折)
EPSILON_START = 1.0  # 初始探索率(100% 随机动作)
EPSILON_END = 0.01   # 最终探索率
EPSILON_DECAY = 500  # 探索率衰减速度
MEMORY_SIZE = 10000  # 记忆库大小
TARGET_UPDATE = 10   # 每隔多少步更新一次目标网络
HIDDEN_SIZE = 64     # 神经网络隐藏层大小
LR = 1e-3            # 学习率
EPISODES = 100       # 训练回合数

# -------------------------
# 2. 简单的神经网络 (Q网络)
# -------------------------
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, HIDDEN_SIZE),
            nn.ReLU(),
            nn.Linear(HIDDEN_SIZE, output_dim)
        )

    def forward(self, x):
        return self.net(x)

# -------------------------
# 3. 经验回放
# -------------------------
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = map(np.stack, zip(*batch))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

# -------------------------
# 4. 训练主循环
# -------------------------
def train():
    # 创建环境
    # gymnasium 建议显式指定 render_mode,如果不需要可视化可以去掉或设为 None
    env = gym.make('CartPole-v1', render_mode=None) 
    
    # 获取状态和动作的维度
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    # 初始化网络
    policy_net = DQN(state_dim, action_dim) # 策略网络(用于选择动作)
    target_net = DQN(state_dim, action_dim) # 目标网络(用于计算目标Q值)
    target_net.load_state_dict(policy_net.state_dict()) # 初始时两者参数相同
    target_net.eval() # 目标网络不需要训练

    optimizer = optim.Adam(policy_net.parameters(), lr=LR)
    memory = ReplayBuffer(MEMORY_SIZE)
    
    steps_done = 0

    for episode in range(EPISODES):
        # <--- 修改点 1: gymnasium 的 reset 返回
        state, _ = env.reset() # 接收两个返回值,忽略第二个 info
        
        total_reward = 0
        
        while True:
            # --- 选择动作 ---
            # Epsilon-Greedy 策略:随着训练进行,逐渐减少随机探索
            epsilon = EPSILON_END + (EPSILON_START - EPSILON_END) * \
                      np.exp(-1. * steps_done / EPSILON_DECAY)
            
            if random.random() < epsilon:
                action = env.action_space.sample() # 随机动作
            else:
                # 将状态转为Tensor并输入网络,选择Q值最大的动作
                with torch.no_grad():
                    state_tensor = torch.FloatTensor(state).unsqueeze(0)
                    q_values = policy_net(state_tensor)
                    action = q_values.argmax().item()

            # --- 执行动作 ---
            # <--- 修改点 2: gymnasium 的 step 返回 5 个值
            next_state, reward, terminated, truncated, _ = env.step(action)
            
            # <--- 修改点 3: 合并 terminated 和 truncated 为 done
            done = terminated or truncated
            
            total_reward += reward
            
            # --- 存储经验 ---
            memory.push(state, action, reward, next_state, done)

            # --- 训练网络 ---
            if len(memory) >= BATCH_SIZE:
                # 从记忆库中采样
                states, actions, rewards, next_states, dones = memory.sample(BATCH_SIZE)
                
                # 转换为Tensor
                states = torch.FloatTensor(states)
                actions = torch.LongTensor(actions).unsqueeze(1)
                rewards = torch.FloatTensor(rewards).unsqueeze(1)
                next_states = torch.FloatTensor(next_states)
                dones = torch.FloatTensor(dones).unsqueeze(1)

                # 计算当前Q值
                current_q_values = policy_net(states).gather(1, actions)
                
                # 计算目标Q值
                # Q_target = r + gamma * max(Q_target_net(s'))
                with torch.no_grad():
                    max_next_q_values = target_net(next_states).max(1)[0].unsqueeze(1)
                    target_q_values = rewards + GAMMA * max_next_q_values * (1 - dones)

                # 计算损失
                loss = nn.MSELoss()(current_q_values, target_q_values)

                # 反向传播更新参数
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            state = next_state
            steps_done += 1

            if done:
                break

        # --- 更新目标网络 ---
        if episode % TARGET_UPDATE == 0:
            target_net.load_state_dict(policy_net.state_dict())

        # 打印进度
        if episode % 10 == 0:
            print(f"Episode: {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.2f}")

    print("训练完成!")
    env.close()

if __name__ == "__main__":
    train()

然后放一个可视化版本,

python 复制代码
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
from collections import deque
import random
import time  # <--- 新增:用于控制演示时的速度

# -------------------------
# 1. 超参数设置
# -------------------------
BATCH_SIZE = 32      # 每次训练取样的数量
GAMMA = 0.99         # 折扣因子(对未来的奖励打多少折)
EPSILON_START = 1.0  # 初始探索率(100% 随机动作)
EPSILON_END = 0.01   # 最终探索率
EPSILON_DECAY = 500  # 探索率衰减速度
MEMORY_SIZE = 10000  # 记忆库大小
TARGET_UPDATE = 10   # 每隔多少步更新一次目标网络
HIDDEN_SIZE = 64     # 神经网络隐藏层大小
LR = 1e-3            # 学习率
EPISODES = 100       # 训练回合数

# -------------------------
# 2. 简单的神经网络 (Q网络)
# -------------------------
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, HIDDEN_SIZE),
            nn.ReLU(),
            nn.Linear(HIDDEN_SIZE, output_dim)
        )

    def forward(self, x):
        return self.net(x)

# -------------------------
# 3. 经验回放
# -------------------------
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = map(np.stack, zip(*batch))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

# -------------------------
# 4. 训练主循环
# -------------------------
def train():
    # 创建环境 (训练时不需要渲染,设为 None 以加快速度)
    env = gym.make('CartPole-v1', render_mode=None)
    
    # 获取状态和动作的维度
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    # 初始化网络
    policy_net = DQN(state_dim, action_dim) # 策略网络(用于选择动作)
    target_net = DQN(state_dim, action_dim) # 目标网络(用于计算目标Q值)
    target_net.load_state_dict(policy_net.state_dict()) # 初始时两者参数相同
    target_net.eval() # 目标网络不需要训练

    optimizer = optim.Adam(policy_net.parameters(), lr=LR)
    memory = ReplayBuffer(MEMORY_SIZE)
    
    steps_done = 0

    print("开始训练...")
    for episode in range(EPISODES):
        state, _ = env.reset()
        total_reward = 0
        
        while True:
            # --- 选择动作 ---
            epsilon = EPSILON_END + (EPSILON_START - EPSILON_END) * \
                      np.exp(-1. * steps_done / EPSILON_DECAY)
            
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                with torch.no_grad():
                    state_tensor = torch.FloatTensor(state).unsqueeze(0)
                    q_values = policy_net(state_tensor)
                    action = q_values.argmax().item()

            # --- 执行动作 ---
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward
            
            # --- 存储经验 ---
            memory.push(state, action, reward, next_state, done)

            # --- 训练网络 ---
            if len(memory) >= BATCH_SIZE:
                states, actions, rewards, next_states, dones = memory.sample(BATCH_SIZE)
                
                states = torch.FloatTensor(states)
                actions = torch.LongTensor(actions).unsqueeze(1)
                rewards = torch.FloatTensor(rewards).unsqueeze(1)
                next_states = torch.FloatTensor(next_states)
                dones = torch.FloatTensor(dones).unsqueeze(1)

                current_q_values = policy_net(states).gather(1, actions)
                
                with torch.no_grad():
                    max_next_q_values = target_net(next_states).max(1)[0].unsqueeze(1)
                    target_q_values = rewards + GAMMA * max_next_q_values * (1 - dones)

                loss = nn.MSELoss()(current_q_values, target_q_values)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            state = next_state
            steps_done += 1

            if done:
                break

        # --- 更新目标网络 ---
        if episode % TARGET_UPDATE == 0:
            target_net.load_state_dict(policy_net.state_dict())

        if episode % 10 == 0:
            print(f"Episode: {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.2f}")

    print("训练完成!")
    env.close()
    
    # 返回训练好的网络,以便后续测试
    return policy_net, state_dim, action_dim

# -------------------------
# 5. 测试/演示函数
# -------------------------
def test(policy_net, state_dim, action_dim):
    print("\n开始演示...")
    # 创建环境,开启渲染模式
    env = gym.make('CartPole-v1', render_mode="human")
    
    # 加载训练好的网络参数
    # 注意:这里直接使用传入的 policy_net,它已经包含了训练好的权重
    
    # 运行 5 个回合进行演示
    for i in range(5):
        state, _ = env.reset()
        total_reward = 0
        print(f"=== 第 {i+1} 轮演示 ===")
        
        while True:
            env.render() # 刷新画面
            
            # 直接使用网络选择最优动作 (不进行随机探索)
            with torch.no_grad():
                state_tensor = torch.FloatTensor(state).unsqueeze(0)
                q_values = policy_net(state_tensor)
                action = q_values.argmax().item()

            # 执行动作
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward
            
            state = next_state
            
            # 稍微加一点延时,让人眼能看清动作
            time.sleep(0.02) 

            if done:
                print(f"本轮得分: {total_reward}")
                break
                
    env.close()

if __name__ == "__main__":
    # 1. 进行训练
    trained_net, s_dim, a_dim = train()
    
    # 2. 进行演示
    test(trained_net, s_dim, a_dim)

如果出现问题------pygame is not installed, run `pip install "gymnasium[classic-control]"`

这个错误提示非常明确。gymnasium 为了保持核心库的轻量化(这点我挺喜欢),将图形化界面(渲染)的功能拆分到了额外的依赖包中。

对于 CartPole-v1(倒立摆)这种经典控制环境,你需要安装 classic-control 包。

解决方法

请在终端(Terminal 或 CMD)中运行以下命令:

bash 复制代码
pip install "gymnasium[classic-control]"

安装完成后再运行代码

安装完成后,再次运行你的 Python 代码,render_mode="human" 就能正常工作了,你将会看到一个小车平衡杆的动画窗口。

注意 :如果你是在远程服务器(如 Google Colab、Kaggle 或实验室服务器)上运行,通常是无法弹出图形窗口的。在这种情况下,你需要将 render_mode 设置为 rgb_array,然后使用 Matplotlib 或其他方式将图像绘制出来。但在本地电脑上直接运行,按照上面的方法安装即可。

复制代码

PS:2024 年确实有开发者针对 OpenAI Gym(或类似平台)的经典CartPole(小车倒立摆)模型,推导出了可直接运行的显式解代码 ------ 但核心前提是:这个显式解依然是针对CartPole线性化模型的,而非完整非线性模型,只是 2024 年的相关代码把 "显式解推导 + 仿真验证" 做了更完整的工程化实现。

相关推荐
GEO_Huang2 小时前
企业智脑如何生成决策方案?数谷的AI定制化服务的深度在哪?
大数据·人工智能·rpa·geo·ai定制·企业ai智能体定制
星浩AI2 小时前
清华团队开源!我给孩子制作了 AI 互动课堂,手把手教你给孩子做一个
人工智能·后端·github
图图的点云库2 小时前
点云深度学习算法概述
人工智能·深度学习·算法
哥布林学者2 小时前
高光谱成像(十二)光谱重建(Spectral Reconstruction)
机器学习·高光谱成像
EasyGBS2 小时前
国标GB28181视频分析平台EasyGBS视频质量诊断让监控故障“可防可控可溯源“
人工智能·音视频·gb28181·视频质量诊断
大傻^2 小时前
LangChain4j 1.4.0 快速入门:JDK 11+ 基线迁移与首个 AI Service 构建
java·开发语言·人工智能
ZEGO即构2 小时前
AI口语教学新解:即构AI数字人破解“开口难”与“成本高”
人工智能·数字人·ai数字人·互动场景
balmtv2 小时前
Grok 4技术架构深度拆解:四智能体辩论、78%不幻觉率与每周自迭代的工程革命
人工智能·架构
tiantian_cool2 小时前
从零到一构建临床文献智能研究Agent(二):LangGraph 多智能体编排
人工智能