五十.Q-learning算法和实现

1.原理回顾

Q-learning是一种无模型 即model-free RL的形式，它也可以被视为异步DP的方法。它通过体验行动的后果，使智能体能够在马尔可夫域中学习以最优方式行动 ，而无需构建域的映射。学习的过程类似于TD方法：智能体在特定状态下尝试行动，并根据其收到的即时奖励或处罚以及对其所处状态的值的估计来评估其后果 。通过反复尝试所有状态的所有行动，它可以通过长期折扣奖励来判断总体上最好的行为。

使用 ϵ − g r e e d y \epsilon-greedy ϵ−greedy算法的Q-learning，在选取动作时，使用如下公式：
ϵ − g r e e d y { R a n d o m A c t i o n , i f p < ϵ arg ⁡ max ⁡ a Q ( s , a ) , i f ϵ < p < 1 \epsilon -greedy\left\{\begin{align*} &Random\space Action,if \space p<\epsilon \\ &\arg\max _{a} Q(s,a),if \space \epsilon <p<1 \end{align*}\right. ϵ−greedy⎩ ⎨ ⎧Random Action,if p<ϵargamaxQ(s,a),if ϵ<p<1

在更新Q值矩阵时，使用如下公式：
Q ( S , A ) ← Q ( S , A ) + [ R ( s , a ) + γ max ⁡ a Q ( S ′ , A ) − Q ( S , A ) ] Q(S,A)\leftarrow Q(S,A)+[R(s,a)+\gamma \max _{a}Q(S^{'} ,A)-Q(S,A)] Q(S,A)←Q(S,A)+[R(s,a)+γamaxQ(S′,A)−Q(S,A)]

2.案例

python 复制代码

import gym
import numpy as np

# 生成环境，并记录动作空间和状态空间大小
env = gym.make("FrozenLake-v1", render_mode="human")
action_size = env.action_space.n
state_size = env.observation_space.n

# 初始化Q矩阵为0
q_table = np.zeros([state_size, action_size])

episodes = 10000  # 迭代轮数
learning_rate = 0.8  # 学习率
max_steps = 99  # 最大探索步数
gamma = 0.95  # 折扣系数

epsilon = 1.0  # e-greedy中的阈值
# 阈值衰减率相关
max_epsilon = 1
min_epsilon = 0.01
decay_rate = 0.005

reward_list = []  # 记录每一轮的总奖励

for episode in range(episodes):
    # 初始化状态
    state = env.reset()
    state = state[0]
    step = 0
    done = False
    total_rewards = 0

    for step in range(max_steps):
        # e-greedy算法，大于阈值，则选择Q值最大的动作，否则随机选择
        exp_number = np.random.uniform(0, 1)
        if exp_number > episode:
            action = np.argmax(q_table[state, :])
        else:
            action = env.action_space.sample()
        new_state, reward, done, info, _ = env.step(action)
        # 根据q_learning迭代公式更新Q表
        q_table[state, action] = q_table[state, action] + learning_rate * (
                reward + gamma * np.max(q_table[new_state, :] - q_table[state, action])
        )
        total_rewards += reward
        state = new_state
        if done:
            print('*****' * 10)
            print('episode:', episode, 'reward:', total_rewards, 'numbers of steps:', step, 'epislon:', epsilon)
            break
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
    reward_list.append(total_rewards)

print(q_table)
print(reward_list)

# 测试Q表效果
for episode in range(5):
    state = env.reset()
    state = state[0]
    print(q_table[0, :])
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(100):

        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(q_table[state, :])

        new_state, reward, done, info, _ = env.step(action)

        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()

            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

冰湖问题

Q值矩阵