从gym到gymnasium的倒立摆

新版 NumPy 移除了 np.bool8np.int64 等简写别名(因为它们容易和 Python 原生类型混淆),推荐使用 np.bool_np.int_ 等。但是旧版 gym 的代码里写死了 np.bool8,所以运行时会报错。上面的代码手动把这些别名加回去,就能让 gym 正常工作。

gym版本的倒立摆,当初我2023年为了跑通它,恰好逢其改版,还得降numpy的版本......

python 复制代码
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gym
from collections import deque
import random

# -------------------------
# 1. 超参数设置
# -------------------------
BATCH_SIZE = 32      # 每次训练取样的数量
GAMMA = 0.99         # 折扣因子(对未来的奖励打多少折)
EPSILON_START = 1.0  # 初始探索率(100% 随机动作)
EPSILON_END = 0.01   # 最终探索率
EPSILON_DECAY = 500  # 探索率衰减速度
MEMORY_SIZE = 10000  # 记忆库大小
TARGET_UPDATE = 10   # 每隔多少步更新一次目标网络
HIDDEN_SIZE = 64     # 神经网络隐藏层大小
LR = 1e-3            # 学习率
EPISODES = 500       # 训练回合数

# -------------------------
# 2. 简单的神经网络 (Q网络)
# -------------------------
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, HIDDEN_SIZE),
            nn.ReLU(),
            nn.Linear(HIDDEN_SIZE, output_dim)
        )

    def forward(self, x):
        return self.net(x)

# -------------------------
# 3. 经验回放
# -------------------------
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = map(np.stack, zip(*batch))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

# -------------------------
# 4. 训练主循环
# -------------------------
def train():
    # 创建环境
    env = gym.make('CartPole-v1')
    
    # 获取状态和动作的维度
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    # 初始化网络
    policy_net = DQN(state_dim, action_dim) # 策略网络(用于选择动作)
    target_net = DQN(state_dim, action_dim) # 目标网络(用于计算目标Q值)
    target_net.load_state_dict(policy_net.state_dict()) # 初始时两者参数相同
    target_net.eval() # 目标网络不需要训练

    optimizer = optim.Adam(policy_net.parameters(), lr=LR)
    memory = ReplayBuffer(MEMORY_SIZE)
    
    steps_done = 0

    for episode in range(EPISODES):
        state = env.reset()
        total_reward = 0
        
        while True:
            # --- 选择动作 ---
            # Epsilon-Greedy 策略:随着训练进行,逐渐减少随机探索
            epsilon = EPSILON_END + (EPSILON_START - EPSILON_END) * \
                      np.exp(-1. * steps_done / EPSILON_DECAY)
            
            if random.random() < epsilon:
                action = env.action_space.sample() # 随机动作
            else:
                # 将状态转为Tensor并输入网络,选择Q值最大的动作
                with torch.no_grad():
                    state_tensor = torch.FloatTensor(state).unsqueeze(0)
                    q_values = policy_net(state_tensor)
                    action = q_values.argmax().item()

            # --- 执行动作 ---
            next_state, reward, done, _ = env.step(action)
            total_reward += reward
            
            # --- 存储经验 ---
            memory.push(state, action, reward, next_state, done)

            # --- 训练网络 ---
            if len(memory) >= BATCH_SIZE:
                # 从记忆库中采样
                states, actions, rewards, next_states, dones = memory.sample(BATCH_SIZE)
                
                # 转换为Tensor
                states = torch.FloatTensor(states)
                actions = torch.LongTensor(actions).unsqueeze(1)
                rewards = torch.FloatTensor(rewards).unsqueeze(1)
                next_states = torch.FloatTensor(next_states)
                dones = torch.FloatTensor(dones).unsqueeze(1)

                # 计算当前Q值
                current_q_values = policy_net(states).gather(1, actions)
                
                # 计算目标Q值
                # Q_target = r + gamma * max(Q_target_net(s'))
                with torch.no_grad():
                    max_next_q_values = target_net(next_states).max(1)[0].unsqueeze(1)
                    target_q_values = rewards + GAMMA * max_next_q_values * (1 - dones)

                # 计算损失
                loss = nn.MSELoss()(current_q_values, target_q_values)

                # 反向传播更新参数
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            state = next_state
            steps_done += 1

            if done:
                break

        # --- 更新目标网络 ---
        if episode % TARGET_UPDATE == 0:
            target_net.load_state_dict(policy_net.state_dict())

        # 打印进度
        if episode % 10 == 0:
            print(f"Episode: {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.2f}")

    print("训练完成!")
    env.close()

if __name__ == "__main__":
    train()

如果现在用,就会出现module 'numpy' has no attribute 'bool8'的问题,

这个错误 AttributeError: module 'numpy' has no attribute 'bool8' 通常是因为 NumPy 版本更新导致的兼容性问题

原因分析

NumPy 1.20 版本中,numpy.bool8 被标记为"已弃用",而在 NumPy 1.24 及更高版本中,它被完全移除了。

如果你使用的代码(或者某个第三方库)中仍然引用了 numpy.bool8,而你的环境安装了较新的 NumPy,就会报这个错。

今天为了再测试DQN,想起来问智谱GLM,竟然全给我改好了!

从2022年刚开始能生成一些功能代码,到现在都可以自己搜索和粘合了,知识和信息的技术进步真的太快了,太棒了!我甚至都不用去搜索和找配对的问题,再去理解其答案配对到自己的问题,现在的大模型已经可以一键解决了!

这意味着,未来甚至不同语言都不用自己去学习了,距离完全的"意思流动"也越来越近了!

(代码注释上标了几个修改点,就是改版后的)

python 复制代码
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
from collections import deque
import random

# -------------------------
# 1. 超参数设置
# -------------------------
BATCH_SIZE = 32      # 每次训练取样的数量
GAMMA = 0.99         # 折扣因子(对未来的奖励打多少折)
EPSILON_START = 1.0  # 初始探索率(100% 随机动作)
EPSILON_END = 0.01   # 最终探索率
EPSILON_DECAY = 500  # 探索率衰减速度
MEMORY_SIZE = 10000  # 记忆库大小
TARGET_UPDATE = 10   # 每隔多少步更新一次目标网络
HIDDEN_SIZE = 64     # 神经网络隐藏层大小
LR = 1e-3            # 学习率
EPISODES = 100       # 训练回合数

# -------------------------
# 2. 简单的神经网络 (Q网络)
# -------------------------
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, HIDDEN_SIZE),
            nn.ReLU(),
            nn.Linear(HIDDEN_SIZE, output_dim)
        )

    def forward(self, x):
        return self.net(x)

# -------------------------
# 3. 经验回放
# -------------------------
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = map(np.stack, zip(*batch))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

# -------------------------
# 4. 训练主循环
# -------------------------
def train():
    # 创建环境
    # gymnasium 建议显式指定 render_mode,如果不需要可视化可以去掉或设为 None
    env = gym.make('CartPole-v1', render_mode=None) 
    
    # 获取状态和动作的维度
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    # 初始化网络
    policy_net = DQN(state_dim, action_dim) # 策略网络(用于选择动作)
    target_net = DQN(state_dim, action_dim) # 目标网络(用于计算目标Q值)
    target_net.load_state_dict(policy_net.state_dict()) # 初始时两者参数相同
    target_net.eval() # 目标网络不需要训练

    optimizer = optim.Adam(policy_net.parameters(), lr=LR)
    memory = ReplayBuffer(MEMORY_SIZE)
    
    steps_done = 0

    for episode in range(EPISODES):
        # <--- 修改点 1: gymnasium 的 reset 返回
        state, _ = env.reset() # 接收两个返回值,忽略第二个 info
        
        total_reward = 0
        
        while True:
            # --- 选择动作 ---
            # Epsilon-Greedy 策略:随着训练进行,逐渐减少随机探索
            epsilon = EPSILON_END + (EPSILON_START - EPSILON_END) * \
                      np.exp(-1. * steps_done / EPSILON_DECAY)
            
            if random.random() < epsilon:
                action = env.action_space.sample() # 随机动作
            else:
                # 将状态转为Tensor并输入网络,选择Q值最大的动作
                with torch.no_grad():
                    state_tensor = torch.FloatTensor(state).unsqueeze(0)
                    q_values = policy_net(state_tensor)
                    action = q_values.argmax().item()

            # --- 执行动作 ---
            # <--- 修改点 2: gymnasium 的 step 返回 5 个值
            next_state, reward, terminated, truncated, _ = env.step(action)
            
            # <--- 修改点 3: 合并 terminated 和 truncated 为 done
            done = terminated or truncated
            
            total_reward += reward
            
            # --- 存储经验 ---
            memory.push(state, action, reward, next_state, done)

            # --- 训练网络 ---
            if len(memory) >= BATCH_SIZE:
                # 从记忆库中采样
                states, actions, rewards, next_states, dones = memory.sample(BATCH_SIZE)
                
                # 转换为Tensor
                states = torch.FloatTensor(states)
                actions = torch.LongTensor(actions).unsqueeze(1)
                rewards = torch.FloatTensor(rewards).unsqueeze(1)
                next_states = torch.FloatTensor(next_states)
                dones = torch.FloatTensor(dones).unsqueeze(1)

                # 计算当前Q值
                current_q_values = policy_net(states).gather(1, actions)
                
                # 计算目标Q值
                # Q_target = r + gamma * max(Q_target_net(s'))
                with torch.no_grad():
                    max_next_q_values = target_net(next_states).max(1)[0].unsqueeze(1)
                    target_q_values = rewards + GAMMA * max_next_q_values * (1 - dones)

                # 计算损失
                loss = nn.MSELoss()(current_q_values, target_q_values)

                # 反向传播更新参数
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            state = next_state
            steps_done += 1

            if done:
                break

        # --- 更新目标网络 ---
        if episode % TARGET_UPDATE == 0:
            target_net.load_state_dict(policy_net.state_dict())

        # 打印进度
        if episode % 10 == 0:
            print(f"Episode: {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.2f}")

    print("训练完成!")
    env.close()

if __name__ == "__main__":
    train()

然后放一个可视化版本,

python 复制代码
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
from collections import deque
import random
import time  # <--- 新增:用于控制演示时的速度

# -------------------------
# 1. 超参数设置
# -------------------------
BATCH_SIZE = 32      # 每次训练取样的数量
GAMMA = 0.99         # 折扣因子(对未来的奖励打多少折)
EPSILON_START = 1.0  # 初始探索率(100% 随机动作)
EPSILON_END = 0.01   # 最终探索率
EPSILON_DECAY = 500  # 探索率衰减速度
MEMORY_SIZE = 10000  # 记忆库大小
TARGET_UPDATE = 10   # 每隔多少步更新一次目标网络
HIDDEN_SIZE = 64     # 神经网络隐藏层大小
LR = 1e-3            # 学习率
EPISODES = 100       # 训练回合数

# -------------------------
# 2. 简单的神经网络 (Q网络)
# -------------------------
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, HIDDEN_SIZE),
            nn.ReLU(),
            nn.Linear(HIDDEN_SIZE, output_dim)
        )

    def forward(self, x):
        return self.net(x)

# -------------------------
# 3. 经验回放
# -------------------------
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = map(np.stack, zip(*batch))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

# -------------------------
# 4. 训练主循环
# -------------------------
def train():
    # 创建环境 (训练时不需要渲染,设为 None 以加快速度)
    env = gym.make('CartPole-v1', render_mode=None)
    
    # 获取状态和动作的维度
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    # 初始化网络
    policy_net = DQN(state_dim, action_dim) # 策略网络(用于选择动作)
    target_net = DQN(state_dim, action_dim) # 目标网络(用于计算目标Q值)
    target_net.load_state_dict(policy_net.state_dict()) # 初始时两者参数相同
    target_net.eval() # 目标网络不需要训练

    optimizer = optim.Adam(policy_net.parameters(), lr=LR)
    memory = ReplayBuffer(MEMORY_SIZE)
    
    steps_done = 0

    print("开始训练...")
    for episode in range(EPISODES):
        state, _ = env.reset()
        total_reward = 0
        
        while True:
            # --- 选择动作 ---
            epsilon = EPSILON_END + (EPSILON_START - EPSILON_END) * \
                      np.exp(-1. * steps_done / EPSILON_DECAY)
            
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                with torch.no_grad():
                    state_tensor = torch.FloatTensor(state).unsqueeze(0)
                    q_values = policy_net(state_tensor)
                    action = q_values.argmax().item()

            # --- 执行动作 ---
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward
            
            # --- 存储经验 ---
            memory.push(state, action, reward, next_state, done)

            # --- 训练网络 ---
            if len(memory) >= BATCH_SIZE:
                states, actions, rewards, next_states, dones = memory.sample(BATCH_SIZE)
                
                states = torch.FloatTensor(states)
                actions = torch.LongTensor(actions).unsqueeze(1)
                rewards = torch.FloatTensor(rewards).unsqueeze(1)
                next_states = torch.FloatTensor(next_states)
                dones = torch.FloatTensor(dones).unsqueeze(1)

                current_q_values = policy_net(states).gather(1, actions)
                
                with torch.no_grad():
                    max_next_q_values = target_net(next_states).max(1)[0].unsqueeze(1)
                    target_q_values = rewards + GAMMA * max_next_q_values * (1 - dones)

                loss = nn.MSELoss()(current_q_values, target_q_values)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            state = next_state
            steps_done += 1

            if done:
                break

        # --- 更新目标网络 ---
        if episode % TARGET_UPDATE == 0:
            target_net.load_state_dict(policy_net.state_dict())

        if episode % 10 == 0:
            print(f"Episode: {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.2f}")

    print("训练完成!")
    env.close()
    
    # 返回训练好的网络,以便后续测试
    return policy_net, state_dim, action_dim

# -------------------------
# 5. 测试/演示函数
# -------------------------
def test(policy_net, state_dim, action_dim):
    print("\n开始演示...")
    # 创建环境,开启渲染模式
    env = gym.make('CartPole-v1', render_mode="human")
    
    # 加载训练好的网络参数
    # 注意:这里直接使用传入的 policy_net,它已经包含了训练好的权重
    
    # 运行 5 个回合进行演示
    for i in range(5):
        state, _ = env.reset()
        total_reward = 0
        print(f"=== 第 {i+1} 轮演示 ===")
        
        while True:
            env.render() # 刷新画面
            
            # 直接使用网络选择最优动作 (不进行随机探索)
            with torch.no_grad():
                state_tensor = torch.FloatTensor(state).unsqueeze(0)
                q_values = policy_net(state_tensor)
                action = q_values.argmax().item()

            # 执行动作
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward
            
            state = next_state
            
            # 稍微加一点延时,让人眼能看清动作
            time.sleep(0.02) 

            if done:
                print(f"本轮得分: {total_reward}")
                break
                
    env.close()

if __name__ == "__main__":
    # 1. 进行训练
    trained_net, s_dim, a_dim = train()
    
    # 2. 进行演示
    test(trained_net, s_dim, a_dim)

如果出现问题------pygame is not installed, run `pip install "gymnasium[classic-control]"`

这个错误提示非常明确。gymnasium 为了保持核心库的轻量化(这点我挺喜欢),将图形化界面(渲染)的功能拆分到了额外的依赖包中。

对于 CartPole-v1(倒立摆)这种经典控制环境,你需要安装 classic-control 包。

解决方法

请在终端(Terminal 或 CMD)中运行以下命令:

bash 复制代码
pip install "gymnasium[classic-control]"

安装完成后再运行代码

安装完成后,再次运行你的 Python 代码,render_mode="human" 就能正常工作了,你将会看到一个小车平衡杆的动画窗口。

注意 :如果你是在远程服务器(如 Google Colab、Kaggle 或实验室服务器)上运行,通常是无法弹出图形窗口的。在这种情况下,你需要将 render_mode 设置为 rgb_array,然后使用 Matplotlib 或其他方式将图像绘制出来。但在本地电脑上直接运行,按照上面的方法安装即可。

复制代码

PS:2024 年确实有开发者针对 OpenAI Gym(或类似平台)的经典CartPole(小车倒立摆)模型,推导出了可直接运行的显式解代码 ------ 但核心前提是:这个显式解依然是针对CartPole线性化模型的,而非完整非线性模型,只是 2024 年的相关代码把 "显式解推导 + 仿真验证" 做了更完整的工程化实现。

相关推荐
青云计划13 小时前
给 AI 写一份老厨师的菜谱:从传统文档到 Skill 知识体系
人工智能
Luminbox紫创测控13 小时前
基于环境舱的新能源汽车三高试验方法与热响应评估
大数据·人工智能·测试工具·汽车·安全性测试·测试标准
码小猿的CPP工坊13 小时前
AI时代C++软件开发工程师的思考
c++·人工智能
AI布道师-wang13 小时前
第 6 章:Prompt 工程——和模型高效沟通
人工智能·机器学习·prompt
老王谈企服13 小时前
AI Agent将如何重构制造业的安全生产隐患识别模式?深度理解与实在Agent闭环实战
人工智能·安全·ai·重构
枫叶林FYL13 小时前
【机器学习与智慧医疗】糖尿病视网膜病变视力丧失预测:贝叶斯估计与威布尔分布
大数据·人工智能·机器学习
rayyy913 小时前
神经网络拟合高频信号实验
人工智能·pytorch·神经网络
逆境不可逃13 小时前
Hello-Agents 第二部分-第八章总结:记忆与检索
人工智能·向量·rag
Fabarta技术团队13 小时前
模数共振・智能就位|枫清科技以企业级 AI Agent,响应国家 “智能体即服务” 战略
人工智能·科技
Terrence Shen13 小时前
Agent面试八股文(系列之三)
人工智能·大模型·agent·rag·智能体·大模型技术