前言:
这里结合走迷宫的例子,重点学习一下QLearning迭代更新算法
0,1,2,3,4 是房间,之间绿色的是代表可以走过去。
5为出口
data:image/s3,"s3://crabby-images/bd11e/bd11e875517d6cfa9e900526a2879b47d9cc9882" alt=""
可以用下图表示
data:image/s3,"s3://crabby-images/f281b/f281b06b36e0c720cf9f17c68c50940c22fb862f" alt=""
目录:
- 策略评估
- 策略改进
- 迭代算法
- 走迷宫实现Python
一 策略评估
data:image/s3,"s3://crabby-images/91582/9158231f60628b945bec9d4cd874935f0aa6acb2" alt=""
强化学习最终是为了学习好的策略,在不同的state 下面根据策略
做出最优的action.
对于策略评估我们通过价值函数来度量.
1.1 状态值函数 V
T步累积奖赏: ,
折扣累积奖赏:
1.2 状态-动作值函数 Q
T步累积奖赏: ,
折扣累积奖赏:
1.3 Bellan 等式展开
状态值函数 V
data:image/s3,"s3://crabby-images/9457f/9457f190b35f046e9dd466e31fcc4ca9eff2e20f" alt=""
data:image/s3,"s3://crabby-images/0b954/0b95412b32bb9f233ff84d976c9d257b12a06fc8" alt=""
状态-动作函数Q
data:image/s3,"s3://crabby-images/680a4/680a4388cdc6b739acd1b6aafc339a3abaa81317" alt=""
二 策略改进
强化学习的目的: 尝试各种策略,找到值函数最大的策略(累积奖赏)
data:image/s3,"s3://crabby-images/cc4ff/cc4ff8717981ca656667b0a06be60f42e29ba66e" alt=""
2.1 最优策略值函数
由于最优值函数的累积奖赏已经达到最大值,因此可以对Bellman 等式做个改动,即对动作求和改为最优
..1
...2
则
...3
最优 状态-动作 Bellman 等式为:
data:image/s3,"s3://crabby-images/36c46/36c46da1afa0f35622c6fbae911ebac461adc510" alt=""
data:image/s3,"s3://crabby-images/ee9e8/ee9e800a5db3b18ddea235197f2f3baabc2330a4" alt=""
三 递推改进方式
原始策略为
改进后策略
改变动作的条件为:
data:image/s3,"s3://crabby-images/9bdb3/9bdb390d56cdadb396cf1020e801ed15455004cd" alt=""
data:image/s3,"s3://crabby-images/a3571/a3571dacc4be1306fd7e11d1eb4a5d62184cdda2" alt=""
data:image/s3,"s3://crabby-images/54971/549715c2697f2222c658ef2e55546e984445e2e0" alt=""
...
data:image/s3,"s3://crabby-images/7970f/7970f533d43bff80b688753d09525aa2908eeea9" alt=""
四 值迭代算法
data:image/s3,"s3://crabby-images/1080f/1080f036f95c1816e29b123da0e95556de4589ca" alt=""
4.1 环境变量
Reward 和 QTable 都是矩阵
data:image/s3,"s3://crabby-images/fc95d/fc95d19a4188aeba0ac93feabd55d3976dfad06d" alt=""
4.2 迭代过程
当state 为1,Q 函数更新过程
data:image/s3,"s3://crabby-images/57769/57769ee507723eef880990b147e6603004d10560" alt=""
data:image/s3,"s3://crabby-images/9ede7/9ede78182a9c190fe60c15474123719e2cad7465" alt=""
5.3 收敛结果
data:image/s3,"s3://crabby-images/70b7b/70b7b9454d99fd73f587c3f429645b0ab20439c6" alt=""
五 走迷宫实现Python
reward 我们用一个矩阵表示:
行代表: state
列代表: action
值代表: reward
5.1 Environment.py 实现环境功能
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 15 11:12:13 2023
@author: chengxf2
"""
import numpy as np
from enum import Enum
#print(Weekday.test.value) 房间
class Room(Enum):
room1 = 1
room2 = 2
room3 = 3
room4 = 4
room5 = 5
class Environment():
def action_name(self, action):
if action ==0:
name = "左"
elif action ==1:
name = "上"
elif action ==2:
name = "右"
else:
name = "上"
return name
def __init__(self):
self.R =np.array([ [-1, -1, -1, -1, 0, -1],
[-1, -1, -1, 0, -1, 100],
[-1, -1, -1, 0, -1, -1],
[-1, 0, 0, -1, 0, -1],
[0, -1, -1, 0, -1, 100],
[-1, 0, -1, -1, 0, 100]])
def step(self, state, action):
#即使奖励: 在state, 执行action, 转移新的 next_state,得到的即使奖励
#print("\n step ",state, action)
reward = self.R[state, action]
next_state = action# action 网哪个房间走
if action == Room.room5.value:
done = True
else:
done = False
return next_state, reward,done
5.1 main.py 实现Agent 功能
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 15 11:29:14 2023
@author: chengxf2
"""
# -*- coding: utf-8 -*-
"""
Created on Mon Nov 13 09:39:37 2023
@author: chengxf2
"""
import numpy as np
def init_state(WORLD_SIZE):
S =[]
for i in range(WORLD_SIZE):
for j in range(WORLD_SIZE):
state =[i,j]
S.append(state)
print(S)
# -*- coding: utf-8 -*-
"""
Created on Fri Nov 10 16:48:16 2023
@author: chengxf2
"""
import numpy as np
from environment import Environment
class Agent():
def __init__(self,env):
self.discount_factor = 0.8 #折扣率
self.theta = 1e-3 #最大偏差
self.nS = 6 #状态 个数
self.nA= 6 #动作个数
self.Q = np.zeros((6,6))
self.env = env
self.episode = 500
#当前处于的位置,V 累积奖赏
def one_step_lookahead(self,env, state, action):
#print("\n state :",state, "\t action ",action)
next_state, reward,done = env.step(state, action)
maxQ_sa = max(self.Q[next_state,:])
return next_state, reward, done,maxQ_sa
def value_iteration(self, env, state, discount_factor =1.0):
#随机选择一个action,但是不能为-1
indices = np.where(env.R[state] >-1)[0]
action = np.random.choice(indices,1)[0]
#print("\n state :",state, "\t action ",action)
next_state, reward, done,maxQ_sa = self.one_step_lookahead(env, state, action)
#更新当前的Q值
r = reward + self.discount_factor*maxQ_sa
self.Q[state,action] = int(r)
#未达到目标状态,走到房间5, 执行下一次迭代
if done == False:
self.value_iteration(env, next_state)
def learn(self):
for n in range(self.episode): #最大迭代次数
#随机选择一个状态
state = np.random.randint(0,self.nS)
#必须达到目标状态,跳转到出口房间5
self.value_iteration(env, state, discount_factor= self.discount_factor)
#print("\n n ",n)
print(self.Q)
if __name__ == "__main__":
env = Environment()
agent =Agent(env)
agent.learn()
参考: