系列文章目录
Fundamental Tools
RL【1】:Basic Concepts
RL【2】:Bellman Equation
RL【3】:Bellman Optimality Equation
Algorithm
RL【4】:Value Iteration and Policy Iteration
RL【5】:Monte Carlo Learning
RL【6】:Stochastic Approximation and Stochastic Gradient Descent
Method
RL【7-1】:Temporal-difference Learning
RL【7-2】:Temporal-difference Learning
文章目录
- 系列文章目录
- 前言
- [Stochastic Algorithms](#Stochastic Algorithms)
- [TD Learning of State Values](#TD Learning of State Values)
- Sarsa
-
- [Base Sarsa](#Base Sarsa)
- [Expected Sarsa](#Expected Sarsa)
- [n n n-step Sarsa](#n n n-step Sarsa)
- 总结
前言
本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning
Stochastic Algorithms
First: Simple mean estimation problem
Calculate w = E [ X ] , w = \mathbb{E}[X], w=E[X], based on some i.i.d. samples { x } \{x\} {x} of X X X.
-
By writing g ( w ) = w − E [ X ] g(w) = w - \mathbb{E}[X] g(w)=w−E[X], we can reformulate the problem to a root-finding problem: g ( w ) = 0 g(w) = 0 g(w)=0.
-
Since we can only obtain samples {x} of X, the noisy observation is
g ~ ( w , η ) = w − x = ( w − E [ X ] ) + ( E [ X ] − x ) ≐ g ( w ) + η . \tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta. g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η.
-
Then, according to the RM algorithm, solving g(w)=0:
w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k ( w k − x k ) . w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k (w_k - x_k). wk+1=wk−αkg~(wk,ηk)=wk−αk(wk−xk).
问题背景
我们想要求解某个随机变量 X 的数学期望:
w = E [ X ] , w = \mathbb{E}[X], w=E[X],
但我们不能直接得到 E [ X ] \mathbb{E}[X] E[X],只能获取一组来自 X X X 的 i.i.d. 样本 { x } \{x\} {x}。
转换为根寻找问题
我们把问题写成如下形式:
g ( w ) = w − E [ X ] = 0. g(w) = w - \mathbb{E}[X] = 0. g(w)=w−E[X]=0.
也就是说,如果我们能找到使得 g ( w ) = 0 的 w g(w)=0 的 w g(w)=0的w,那么这个解就是 E [ X ] \mathbb{E}[X] E[X]。
噪声观测
因为只能观测到样本 x x x,所以我们实际上得到的不是 g ( w ) g(w) g(w),而是一个带噪声的观测:
g ~ ( w , η ) = w − x = ( w − E [ X ] ) + ( E [ X ] − x ) ≐ g ( w ) + η , \tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta, g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η,
其中, η = E [ X ] − x \eta = \mathbb{E}[X] - x η=E[X]−x 表示噪声,它的期望为 0 0 0。
RM 更新公式
Robbins--Monro 算法通过迭代更新 w w w,逐步收敛到正确的 E [ X ] \mathbb{E}[X] E[X]。其更新公式为:
w k + 1 = w k − α k g ~ ( w k , η k ) , w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k), wk+1=wk−αkg~(wk,ηk),
代入上面的噪声观测:
w k + 1 = w k − α k ( w k − x k ) . w_{k+1} = w_k - \alpha_k (w_k - x_k). wk+1=wk−αk(wk−xk).
算法直观解释
- 当前估计为 w k w_k wk。
- 我们拿到一个新的样本 x k x_k xk。
- 用差值 w k − x k w_k - x_k wk−xk 来修正估计:如果 w k w_k wk 大于样本 x k x_k xk,更新会往下调;反之则往上调。
- α k \alpha_k αk 是步长,通常随迭代次数减小(比如 α k = 1 / k \alpha_k = 1/k αk=1/k),保证算法收敛。
Second: A more complex problem
Estimate the mean of a function v ( X ) v(X) v(X): w = E [ v ( X ) ] w = \mathbb{E}[v(X)] w=E[v(X)], based on some i.i.d. random samples { x } \{x\} {x} of X X X.
-
To solve this problem, we define
g ( w ) = w − E [ v ( X ) ] g(w) = w - \mathbb{E}[v(X)] g(w)=w−E[v(X)],
g ~ ( w , η ) = w − v ( x ) = ( w − E [ v ( X ) ] ) + ( E [ v ( X ) ] − v ( x ) ) ≐ g ( w ) + η . \tilde{g}(w,\eta) = w - v(x) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta. g~(w,η)=w−v(x)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η.
-
Then, the problem becomes a root-finding problem: g ( w ) = 0 g(w) = 0 g(w)=0. The corresponding RM algorithm is
w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k [ w k − v ( x k ) ] . w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k [w_k - v(x_k)]. wk+1=wk−αkg~(wk,ηk)=wk−αk[wk−v(xk)].
问题背景
我们不再直接估计随机变量 X X X 的均值,而是希望估计某个函数 v ( X ) v(X) v(X) 的期望:
w = E [ v ( X ) ] , w = \mathbb{E}[v(X)] , w=E[v(X)],
其中 v ( ⋅ ) v(\cdot) v(⋅) 是已知函数, X X X 是随机变量。我们能够获得 X X X 的 i.i.d. 样本 { x } \{x\} {x},但无法直接得到 E [ v ( X ) ] \mathbb{E}[v(X)] E[v(X)]。
转换为根寻找问题
类似均值估计问题,我们将目标改写为一个根寻找问题:
g ( w ) = w − E [ v ( X ) ] = 0. g(w) = w - \mathbb{E}[v(X)] = 0. g(w)=w−E[v(X)]=0.
噪声观测
我们无法直接计算 E [ v ( X ) ] \mathbb{E}[v(X)] E[v(X)],但可以通过样本 x x x 进行观测。于是定义一个带噪声的观测函数:
g ~ ( w , η ) = w − v ( x ) . \tilde{g}(w, \eta) = w - v(x). g~(w,η)=w−v(x).
展开来看:
g ~ ( w , η ) = ( w − E [ v ( X ) ] ) + ( E [ v ( X ) ] − v ( x ) ) ≐ g ( w ) + η , \tilde{g}(w, \eta) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta, g~(w,η)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η,
其中 η = E [ v ( X ) ] − v ( x ) \eta = \mathbb{E}[v(X)] - v(x) η=E[v(X)]−v(x),是零均值的噪声。
RM 更新公式
Robbins--Monro 算法通过迭代更新 w w w,使其收敛到 E [ v ( X ) ] \mathbb{E}[v(X)] E[v(X)]。更新公式为:
w k + 1 = w k − α k g ~ ( w k , η k ) . w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k). wk+1=wk−αkg~(wk,ηk).
代入 g ~ ( w , η ) \tilde{g}(w,\eta) g~(w,η) 得到:
w k + 1 = w k − α k [ w k − v ( x k ) ] . w_{k+1} = w_k - \alpha_k [w_k - v(x_k)]. wk+1=wk−αk[wk−v(xk)].
算法直观解释
- 当前估计为 w k w_k wk。
- 我们用一个新样本 x k x_k xk,计算函数值 v ( x k ) v(x_k) v(xk)。
- 更新公式会逐步把 w k w_k wk 调整到 v ( x k ) v(x_k) v(xk) 的方向。
- 多次迭代后, w k w_k wk 会收敛到所有样本的平均值,即 E [ v ( X ) ] \mathbb{E}[v(X)] E[v(X)]。
Third: An even more complex problem
Calculate w = E [ R + γ v ( X ) ] , w = \mathbb{E}[R + \gamma v(X)], w=E[R+γv(X)], where R , X R, X R,X are random variables, γ \gamma γ is a constant, and v ( ⋅ ) v(\cdot) v(⋅) is a function.
-
Suppose we can obtain samples { x } \{x\} {x} and { r } \{r\} {r} of X X X and R R R. we define
g ( w ) = w − E [ R + γ v ( X ) ] , g(w) = w - \mathbb{E}[R + \gamma v(X)], g(w)=w−E[R+γv(X)],
g ~ ( w , η ) = w − [ r + γ v ( x ) ] = ( w − E [ R + γ v ( X ) ] ) + ( E [ R + γ v ( X ) ] − [ r + γ v ( x ) ] ) ≐ g ( w ) + η . \tilde{g}(w,\eta) = w - [r + \gamma v(x)] = (w - \mathbb{E}[R + \gamma v(X)]) + (\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]) \doteq g(w) + \eta. g~(w,η)=w−[r+γv(x)]=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)])≐g(w)+η.
-
Then, the problem becomes a root-finding problem: g ( w ) = 0 g(w) = 0 g(w)=0. The corresponding RM algorithm is
问题背景
我们希望估计期望值:
w = E [ R + γ v ( X ) ] , w = \mathbb{E}[R + \gamma v(X)], w=E[R+γv(X)],
其中:
- R , X R, X R,X 是随机变量;
- γ \gamma γ 是一个常数;
- v ( ⋅ ) v(\cdot) v(⋅) 是一个函数。
也就是说,目标是同时考虑随机奖励 R R R 和函数 v ( X ) v(X) v(X) 的加权期望。
转换为根寻找问题
将问题改写为求解方程 g ( w ) = 0 g(w) = 0 g(w)=0:
g ( w ) = w − E [ R + γ v ( X ) ] . g(w) = w - \mathbb{E}[R + \gamma v(X)]. g(w)=w−E[R+γv(X)].
显然,解为 w ⋆ = E [ R + γ v ( X ) ] w^\star = \mathbb{E}[R + \gamma v(X)] w⋆=E[R+γv(X)]。
噪声观测
由于我们无法直接得到 E [ R + γ v ( X ) ] \mathbb{E}[R + \gamma v(X)] E[R+γv(X)],只能通过样本 ( r , x ) (r, x) (r,x) 来观测:
g ~ ( w , η ) = w − [ r + γ v ( x ) ] . \tilde{g}(w, \eta) = w - [r + \gamma v(x)]. g~(w,η)=w−[r+γv(x)].
展开来看:
g ~ ( w , η ) = ( w − E [ R + γ v ( X ) ] ) + ( E [ R + γ v ( X ) ] − [ r + γ v ( x ) ] ) . \tilde{g}(w, \eta) = (w - \mathbb{E}[R + \gamma v(X)]) + \big(\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]\big). g~(w,η)=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)]).
于是可以写作:
g ~ ( w , η ) ≐ g ( w ) + η , \tilde{g}(w, \eta) \doteq g(w) + \eta, g~(w,η)≐g(w)+η,
其中 η \eta η 是零均值噪声项。
RM 算法迭代公式
Robbins--Monro 算法通过递推公式更新 w w w,逐渐逼近最优解:
w k + 1 = w k − α k g ~ ( w k , η k ) . w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k). wk+1=wk−αkg~(wk,ηk).
代入具体的噪声观测函数,得到:
w k + 1 = w k − α k ( w k − ( r k + γ v ( x k ) ) ) . w_{k+1} = w_k - \alpha_k \Big(w_k - (r_k + \gamma v(x_k))\Big). wk+1=wk−αk(wk−(rk+γv(xk))).
直观解释
- 当前估计为 w k w_k wk;
- 用样本 ( r k , x k ) (r_k, x_k) (rk,xk) 计算近似目标值 r k + γ v ( x k ) r_k + \gamma v(x_k) rk+γv(xk);
- 更新公式让 w k w_k wk 向这个近似目标靠近;
- 随着迭代次数增加, w k w_k wk 会收敛到 E [ R + γ v ( X ) ] \mathbb{E}[R + \gamma v(X)] E[R+γv(X)]。
TD Learning of State Values
Algorithm description
-
The data/experience required by the algorithm:
- ( s 0 , r 1 , s 1 , ... , s t , r t + 1 , s t + 1 , ... ) (s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots) (s0,r1,s1,...,st,rt+1,st+1,...) or { ( s t , r t + 1 , s t + 1 ) } t \{(s_t, r_{t+1}, s_{t+1})\}_t {(st,rt+1,st+1)}t generated following the given policy π \pi π.
-
The TD learning algorithm is
v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − ( r t + 1 + γ v t ( s t + 1 ) ) ] ( 1 ) v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1) vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)
v t + 1 ( s ) = v t ( s ) , ∀ s ≠ s t ( 2 ) v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2) vt+1(s)=vt(s),∀s=st(2)
- where t = 0 , 1 , 2 , ... t = 0,1,2,\ldots t=0,1,2,.... Here, v t ( s t ) v_t(s_t) vt(st) is the estimated state value of v π ( s t ) v_\pi(s_t) vπ(st); α t ( s t ) \alpha_t(s_t) αt(st) is the learning rate of s t s_t st at time t t t.
- At time t t t, only the value of the visited state s t s_t st is updated whereas the values of the unvisited states s ≠ s t s \neq s_t s=st remain unchanged.
背景
在强化学习中,我们希望估计某个策略 π\pi 下的 状态价值函数:
v π ( s ) = E [ G t ∣ S t = s , π ] , v_\pi(s) = \mathbb{E}[G_t \mid S_t = s, \pi], vπ(s)=E[Gt∣St=s,π],
- 其中 G t G_t Gt 是从状态 s s s 出发得到的未来累计回报。
但是我们往往 没有环境的模型(转移概率/奖励分布),所以不能直接用 Bellman 方程去算,只能用采样到的轨迹数据来更新估计值。
算法需要的数据
- 算法只需要从策略 π\pi 下采样的轨迹:
- 完整轨迹 : ( s 0 , r 1 , s 1 , ... , s t , r t + 1 , s t + 1 , ... ) (s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots) (s0,r1,s1,...,st,rt+1,st+1,...),或者
- 三元组集合 : { ( s t , r t + 1 , s t + 1 ) } t \{(s_t, r_{t+1}, s_{t+1})\}_t {(st,rt+1,st+1)}t
- 这意味着 TD 学习可以 在线学习,只需一小步经验(状态-奖励-下一个状态)即可更新。
TD 更新公式
更新访问过的状态:
v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − ( r t + 1 + γ v t ( s t + 1 ) ) ] ( 1 ) v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1) vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)
未访问的状态不变:
v t + 1 ( s ) = v t ( s ) , ∀ s ≠ s t ( 2 ) v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2) vt+1(s)=vt(s),∀s=st(2)
含义
更新公式说明,当前状态 s t s_t st 的价值估计 会朝着 TD Target 靠拢:
TD Target = r t + 1 + γ v t ( s t + 1 ) , \text{TD Target} = r_{t+1} + \gamma v_t(s_{t+1}), TD Target=rt+1+γvt(st+1),
- 即"一步奖励 + 折扣后的下一状态价值"。
更新量由 TD Error 控制:
δ t = ( r t + 1 + γ v t ( s t + 1 ) ) − v t ( s t ) . \delta_t = \big(r_{t+1} + \gamma v_t(s_{t+1})\big) - v_t(s_t). δt=(rt+1+γvt(st+1))−vt(st).
- 它刻画了"预测"和"实际一步观察"之间的差异。
学习率 α t ( s t ) \alpha_t(s_t) αt(st) 决定了更新幅度。
- 如果 α \alpha α 较大,更新更快但不稳定。
- 如果 α \alpha α 较小,更新更慢但更稳定。
Algorithm properties
-
The TD algorithm can be annotated as
$v_{t+1}(s_t)
= \underbrace{v_t(s_t)}_{\text{current estimate}}|
-
\alpha_t(s_t)\Big[\underbrace{v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})]}_{\substack{\text{TD error } \delta_t \ \text{TD target } \bar v_t}}\Big],
\quad (3)$
-
Here,
v ˉ t ≐ r t + 1 + γ v ( s t + 1 ) \bar v_t \doteq r_{t+1} + \gamma v(s_{t+1}) vˉt≐rt+1+γv(st+1)
- is called the TD Target.
δ t ≐ v ( s t ) − [ r t + 1 + γ v ( s t + 1 ) ] = v ( s t ) − v ˉ t \delta_t \doteq v(s_t) - [r_{t+1} + \gamma v(s_{t+1})] = v(s_t) - \bar v_t δt≐v(st)−[rt+1+γv(st+1)]=v(st)−vˉt
- is called the TD error.
-
It is clear that the new estimate v t + 1 ( s t ) v_{t+1}(s_t) vt+1(st) is a combination of the current estimate v t ( s t ) v_t(s_t) vt(st) and the TD error.
-
-
First, why is v ˉ t \bar v_t vˉt called the TD Target?
-
That is because the algorithm drives v ( s t ) v(s_t) v(st) towards v ˉ t \bar v_t vˉt.
-
To see that,
v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − v ˉ t ] v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - \bar v_t] vt+1(st)=vt(st)−αt(st)[vt(st)−vˉt]
⟹ v t + 1 ( s t ) − v ˉ t = v t ( s t ) − v ˉ t − α t ( s t ) [ v t ( s t ) − v ˉ t ] \implies v_{t+1}(s_t) - \bar v_t = v_t(s_t) - \bar v_t - \alpha_t(s_t)[v_t(s_t) - \bar v_t] ⟹vt+1(st)−vˉt=vt(st)−vˉt−αt(st)[vt(st)−vˉt]
⟹ v t + 1 ( s t ) − v ˉ t = [ 1 − α t ( s t ) ] [ v t ( s t ) − v ˉ t ] \implies v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)][v_t(s_t) - \bar v_t] ⟹vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt]
⟹ ∣ v t + 1 ( s t ) − v ˉ t ∣ = ∣ 1 − α t ( s t ) ∣ ∣ v t ( s t ) − v ˉ t ∣ \implies |v_{t+1}(s_t) - \bar v_t| = |1 - \alpha_t(s_t)||v_t(s_t) - \bar v_t| ⟹∣vt+1(st)−vˉt∣=∣1−αt(st)∣∣vt(st)−vˉt∣
-
Since α t ( s t ) \alpha_t(s_t) αt(st) is a small positive number, we have
0 < 1 − α t ( s t ) < 1 0 < 1 - \alpha_t(s_t) < 1 0<1−αt(st)<1
-
Therefore,
∣ v t + 1 ( s t ) − v ˉ t ∣ ≤ ∣ v t ( s t ) − v ˉ t ∣ |v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t| ∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣
-
which means v ( s t ) v(s_t) v(st) is driven towards v ˉ t \bar v_t vˉt!
-
-
-
Second, what is the interpretation of the TD error?
δ t = v ( s t ) − [ r t + 1 + γ v ( s t + 1 ) ] \delta_t = v(s_t) - [r_{t+1} + \gamma v(s_{t+1})] δt=v(st)−[rt+1+γv(st+1)]
-
It is a difference between two consequent time steps.
-
It reflects the deficiency between v t v_t vt and v π v_\pi vπ.
To see that, denote
δ π , t ≐ v π ( s t ) − [ r t + 1 + γ v π ( s t + 1 ) ] \delta_{\pi,t} \doteq v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})] δπ,t≐vπ(st)−[rt+1+γvπ(st+1)]
-
Note that
E [ δ π , t ∣ S t = s t ] = v π ( s t ) − E [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s t ] = 0. \mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t] = 0. E[δπ,t∣St=st]=vπ(st)−E[Rt+1+γvπ(St+1)∣St=st]=0.
- If v t = v π v_t = v_\pi vt=vπ, then δ t \delta_t δt should be zero (in the expectation sense).
- Hence, if δ t \delta_t δt is not zero, then v t v_t vt is not equal to v π v_\pi vπ.
-
The TD error can be interpreted as innovation , which means new information obtained from the experience ( s t , r t + 1 , s t + 1 ) (s_t, r_{t+1}, s_{t+1}) (st,rt+1,st+1).
-
-
Other properties:
- The TD algorithm in (3) only estimates the state value of a given policy .
- It does not estimate the action values.
- It does not search for optimal policies.
- Later, we will see how to estimate action values and then search for optimal policies.
- Nonetheless, the TD algorithm in (3) is fundamental for understanding the core idea.
- The TD algorithm in (3) only estimates the state value of a given policy .
Explanation of TD Algorithm Properties
TD 更新公式回顾
TD 的核心更新公式是:
v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − ( r t + 1 + γ v t ( s t + 1 ) ) ] . v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\Big[v_t(s_t) - \big(r_{t+1} + \gamma v_t(s_{t+1})\big)\Big]. vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))].
- 它包含三部分:
- 当前估计 (current estimate): v t ( s t ) v_t(s_t) vt(st)
- TD Target: v ˉ t = r t + 1 + γ v t ( s t + 1 ) \bar v_t = r_{t+1} + \gamma v_t(s_{t+1}) vˉt=rt+1+γvt(st+1)
- TD Error: δ t = v t ( s t ) − v ˉ t \delta_t = v_t(s_t) - \bar v_t δt=vt(st)−vˉt
因此,更新的本质就是:
v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) δ t , v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\delta_t, vt+1(st)=vt(st)−αt(st)δt,
即在当前估计的基础上,减去与 TD Target 的差值。
为什么 v ˉ t \bar v_t vˉt 被称为 TD Target?
直观解释
- 算法的目标是让 v ( s t ) v(s_t) v(st) 逐渐逼近 v ˉ t \bar v_t vˉt。
- 每次更新时,都会缩小两者之间的差距。
推导过程
v t + 1 ( s t ) − v ˉ t = [ 1 − α t ( s t ) ] [ v t ( s t ) − v ˉ t ] . v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)]\,[v_t(s_t) - \bar v_t]. vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt].
因为学习率 α t ( s t ) \alpha_t(s_t) αt(st) 在 0 < α t ( s t ) < 1 0< \alpha_t(s_t) < 1 0<αt(st)<1,所以每次迭代都会让
∣ v t + 1 ( s t ) − v ˉ t ∣ ≤ ∣ v t ( s t ) − v ˉ t ∣ . |v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t|. ∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣.
这说明 估计值一步步朝着 TD Target 收敛。
TD Error 的解释
定义:
δ t = v ( s t ) − ( r t + 1 + γ v ( s t + 1 ) ) . \delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big). δt=v(st)−(rt+1+γv(st+1)).
含义:
- 它是 两个连续时间步之间的差值;
- 反映了当前估计 v t v_t vt 与真实价值函数 v π v_\pi vπ 的差距。
关键结论:
- E [ δ π , t ∣ S t = s t ] = 0 , 当且仅当 v t = v π \mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = 0, \quad \text{当且仅当 } v_t = v_\pi E[δπ,t∣St=st]=0,当且仅当 vt=vπ
- 如果 δ t = 0 \delta_t = 0 δt=0,说明估计完全正确;
- 如果 δ t ≠ 0 \delta_t \neq 0 δt=0,说明估计与真实价值有偏差。
直观理解:
- TD Error 可以理解为 创新 (innovation),即每次从经验中获得的新信息。
- 更新就是通过 TD Error 将估计逐步修正。
Further Explanation: TD Target & TD ErrorTD Target v ˉ t \bar v_t vˉt
v ˉ t ≐ r t + 1 + γ v ( s t + 1 ) \bar v_t \doteq r_{t+1} + \gamma v(s_{t+1}) vˉt≐rt+1+γv(st+1)
- 直观含义:
- 它是 下一时刻回报的估计 ,由当前奖励 r t + 1 r_{t+1} rt+1 和未来状态价值 v ( s t + 1 ) v(s_{t+1}) v(st+1) 构成。
- 可以看作是 一步预测 (one-step lookahead) :从当前状态 s t s_t st 出发,走一步,收集到即时奖励,再加上下一个状态的估值。
- 类比:
- 如果把 v ( s t ) v(s_t) v(st) 看作我们对房价的预测,那么 v ˉ t \bar v_t vˉt 就是"根据最新成交价 + 未来走势"的修正目标。
- 每次更新就是让预测逐渐接近这个"修正后的目标"。
- 为什么是 target?
因为在 Bellman 方程中:
v π ( s ) = E [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s ] , v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s], vπ(s)=E[Rt+1+γvπ(St+1)∣St=s],
- TD Target v ˉ t \bar v_t vˉt 正是右边的样本近似。
所以 TD Target 是 Bellman 方程的局部实现。
TD Error δ t \delta_t δt
δ t = v ( s t ) − ( r t + 1 + γ v ( s t + 1 ) ) = v ( s t ) − v ˉ t \delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big) = v(s_t) - \bar v_t δt=v(st)−(rt+1+γv(st+1))=v(st)−vˉt
- 直观含义:
- 它是 当前估计 与 TD Target 的差距。
- 如果差距为零,说明预测完美;否则,差距的大小和符号告诉我们更新的方向。
- 解释 1:预测误差
- δ t \delta_t δt 就像是在问:"我的预测 v ( s t ) v(s_t) v(st),和根据实际观察到的奖励修正后的预测 v ˉ t \bar v_t vˉt,差多少?"
- 解释 2:学习信号
δ t \delta_t δt 是 更新的驱动力:
v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) δ t v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\,\delta_t vt+1(st)=vt(st)−αt(st)δt
- 如果 δ t > 0 \delta_t > 0 δt>0:说明预测过高,要下降;
- 如果 δ t < 0 \delta_t < 0 δt<0:说明预测过低,要上升。
- 解释 3:创新 (innovation)
- 在统计学里,innovation 指新信息与已有预测之间的差异。
- 在 TD 中, δ t \delta_t δt 就是 从经验中获得的新信息,它衡量了"我们学到的和我们以为的之间的差异"。
TD Target & TD Error 的关系
- TD Target 提供 学习的目标;
- TD Error 衡量 当前预测与目标的差距;
更新规则就是:
New Estimate = Old Estimate − Learning Rate × TD Error . \text{New Estimate} = \text{Old Estimate} - \text{Learning Rate} \times \text{TD Error}. New Estimate=Old Estimate−Learning Rate×TD Error.
- 即:
- TD Target = "我应该往哪走"
- TD Error = "我现在离目标有多远"
- TD Update = "往目标迈一小步"
The idea of the algorithm
-
First, a new expression of the Bellman equation
-
The definition of state value of π \pi π is
v π ( s ) = E [ R + γ G ∣ S = s ] , s ∈ S ( 4 ) v_\pi(s) = \mathbb{E}[R + \gamma G \mid S = s], \quad s \in \mathcal{S} \quad (4) vπ(s)=E[R+γG∣S=s],s∈S(4)
-
where G G G is discounted return. Since
E [ G ∣ S = s ] = ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) = E [ v π ( S ′ ) ∣ S = s ] , \mathbb{E}[G \mid S = s] = \sum_a \pi(a \mid s) \sum_{s'} p(s' \mid s,a) v_\pi(s') = \mathbb{E}[v_\pi(S') \mid S = s], E[G∣S=s]=∑aπ(a∣s)∑s′p(s′∣s,a)vπ(s′)=E[vπ(S′)∣S=s],
-
where S ′ S' S′ is the next state, we can rewrite (4) as
v π ( s ) = E [ R + γ v π ( S ′ ) ∣ S = s ] , s ∈ S . ( 5 ) v_\pi(s) = \mathbb{E}[R + \gamma v_\pi(S') \mid S = s], \quad s \in \mathcal{S}. \quad (5) vπ(s)=E[R+γvπ(S′)∣S=s],s∈S.(5)
-
Equation (5) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation, an important tool to design and analyze TD algorithms.
这说明 当前状态的价值 可以用 一步奖励 + 下一状态价值的期望 来表示。
-
-
Second, solve the Bellman equation in (5) using the RM algorithm
-
In particular, by defining
g ( v ( s ) ) = v ( s ) − E [ R + γ v π ( S ′ ) ∣ s ] , g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s], g(v(s))=v(s)−E[R+γvπ(S′)∣s],
-
we can rewrite (5) as
g ( v ( s ) ) = 0. g(v(s)) = 0. g(v(s))=0.
-
Since we can only obtain the samples r r r and s ′ s' s′ of R R R and S ′ S' S′, the noisy observation we have is
g ~ ( v ( s ) ) = v ( s ) − [ r + γ v π ( s ′ ) ] = ( v ( s ) − E [ R + γ v π ( S ′ ) ∣ s ] ) ⏟ g ( v ( s ) ) + ( E [ R + γ v π ( S ′ ) ∣ s ] − [ r + γ v π ( s ′ ) ] ) ⏟ η . \tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')] = \underbrace{(v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s])}{g(v(s))} + \underbrace{(\mathbb{E}[R + \gamma v\pi(S') \mid s] - [r + \gamma v_\pi(s')])}_{\eta}. g~(v(s))=v(s)−[r+γvπ(s′)]=g(v(s)) (v(s)−E[R+γvπ(S′)∣s])+η (E[R+γvπ(S′)∣s]−[r+γvπ(s′)]).
用 RM 算法求解 Bellman 方程
-
Bellman 方程形式上是一个 不动点方程:
v ( s ) = E [ R + γ v π ( S ′ ) ∣ s ] . v(s) = \mathbb{E}[R + \gamma v_\pi(S') \mid s]. v(s)=E[R+γvπ(S′)∣s].
-
我们可以改写为 零点问题:
g ( v ( s ) ) = v ( s ) − E [ R + γ v π ( S ′ ) ∣ s ] = 0. g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s] = 0. g(v(s))=v(s)−E[R+γvπ(S′)∣s]=0.
-
为什么要改写成 g ( v ( s ) ) = 0 g(v(s))=0 g(v(s))=0?
- 这是为了使用 随机逼近(Robbins-Monro, RM)算法。
- RM 专门用于在存在噪声的情况下求解零点问题。
-
RM 的做法
-
我们无法直接计算 E [ R + γ v π ( S ′ ) ∣ s ] \mathbb{E}[R + \gamma v_\pi(S') \mid s] E[R+γvπ(S′)∣s],只能通过样本 ( r , s ′ ) (r, s') (r,s′) 来近似。于是我们定义:
g ~ ( v ( s ) ) = v ( s ) − [ r + γ v π ( s ′ ) ] . \tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')]. g~(v(s))=v(s)−[r+γvπ(s′)].
-
RM 更新式为:
v k + 1 ( s ) = v k ( s ) − α k g ~ ( v k ( s ) ) = v k ( s ) − α k ( v k ( s ) − [ r k + γ v π ( s k ′ ) ] ) . v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big). vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]).
-
这一步的含义:
- 用样本 ( r k , s k ′ ) (r_k, s'_k) (rk,sk′) 构造近似的"梯度" g ~ \tilde g g~。
- 然后不断更新 v k ( s ) v_k(s) vk(s),使其逐渐逼近 Bellman 方程的解。
-
-
-
Therefore, the RM algorithm for solving g ( v ( s ) ) = 0 g(v(s)) = 0 g(v(s))=0 is
v k + 1 ( s ) = v k ( s ) − α k g ~ ( v k ( s ) ) = v k ( s ) − α k ( v k ( s ) − [ r k + γ v π ( s k ′ ) ] ) , k = 1 , 2 , 3 , ... ( 6 ) v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big), \quad k = 1,2,3,\ldots \quad (6) vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]),k=1,2,3,...(6)
- where v k ( s ) v_k(s) vk(s) is the estimate of v π ( s ) v_\pi(s) vπ(s) at the k k kth step; r k , s k ′ r_k, s'_k rk,sk′ are the samples of R , S ′ R, S' R,S′ obtained at the k k kth step.
- The RM algorithm in (6) has two assumptions that deserve special attention:
- We must have the experience set ( s , r , s ′ ) {(s, r, s')} (s,r,s′) for k = 1 , 2 , 3 , ... k=1,2,3,\ldots k=1,2,3,....
- We assume that v π ( s ′ ) v_\pi(s') vπ(s′) is already known for any s ′ s' s′.
- To remove the two assumptions in the RM algorithm, we can modify it
- One modification is that ( s , r , s ′ ) {(s,r,s')} (s,r,s′) is changed to ( s t , r t + 1 , s t + 1 ) {(s_t, r_{t+1}, s_{t+1})} (st,rt+1,st+1) so that the algorithm can utilize the sequential samples in an episode.
- Another modification is that v π ( s ′ ) v_\pi(s') vπ(s′) is replaced by an estimate of it because we don't know it in advance.
与 TD 学习的联系
-
RM 算法有两个限制:
- 需要知道 v π ( s ′ ) v_\pi(s') vπ(s′),但实际上我们并不知道;
- 需要 ( s , r , s ′ ) (s, r, s') (s,r,s′) 样本,最好是整个 episode 数据。
-
为了解决这两个问题:
- 我们 用当前的估计 v ( s ′ ) v(s') v(s′) 替换真值 v π ( s ′ ) v_\pi(s') vπ(s′);
- 我们 用序列样本 ( s t , r t + 1 , s t + 1 ) (s_t, r_{t+1}, s_{t+1}) (st,rt+1,st+1) 来进行逐步更新。
-
于是就得到了 TD 更新公式:
v t + 1 ( s t ) = v t ( s t ) + α t [ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) ] . v_{t+1}(s_t) = v_t(s_t) + \alpha_t \big[ r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \big]. vt+1(st)=vt(st)+αt[rt+1+γvt(st+1)−vt(st)].
- 这里:
- TD Target : v ˉ t = r t + 1 + γ v t ( s t + 1 ) \bar v_t = r_{t+1} + \gamma v_t(s_{t+1}) vˉt=rt+1+γvt(st+1)
- TD Error : δ t = v ˉ t − v t ( s t ) \delta_t = \bar v_t - v_t(s_t) δt=vˉt−vt(st)
- 这里:
直观理解
- 对于 trajectory 的每一个 s s s 都用 TD 公式更新到收敛
- "所有状态互相帮助一起慢慢收敛"
Algorithm
-
Algorithm convergence
- By the TD algorithm (1), v t ( s ) v_t(s) vt(s) converges with probability 1 to v π ( s ) v_\pi(s) vπ(s) for all s ∈ S s \in \mathcal{S} s∈S as t → ∞ t \to \infty t→∞ if ∑ t α t ( s ) = ∞ \sum_t \alpha_t(s) = \infty ∑tαt(s)=∞ and ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞ for all s ∈ S s \in \mathcal{S} s∈S.
- Remarks:
- This theorem says the state value can be found by the TD algorithm for a given a policy π \pi π.
- ∑ t α t ( s ) = ∞ \sum_t \alpha_t(s) = \infty ∑tαt(s)=∞ and ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞ must be valid for all s ∈ S s \in \mathcal{S} s∈S . At time step t t t, if s = s t s = s_t s=st which means that s s s is visited at time t t t, then α t ( s ) > 0 \alpha_t(s) > 0 αt(s)>0; otherwise, α t ( s ) = 0 \alpha_t(s) = 0 αt(s)=0 for all the other s ≠ s t s \ne s_t s=st. That requires every state must be visited an infinite (or sufficiently many) number of times.
- The learning rate α \alpha α is often selected as a small constant. In this case, the condition that ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞ is invalid anymore. When α \alpha α is constant, it can still be shown that the algorithm converges in the sense of expectation sense.
定理内容
- 在合适的条件下,TD 学习能够收敛到真实的状态值函数 v π ( s ) v_\pi(s) vπ(s)。
- 条件是:
- ∑ t α t ( s ) = ∞ \sum_t \alpha_t(s) = \infty ∑tαt(s)=∞ (学习率必须足够大,保证无穷多次更新)
- ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞ (学习率不能太大,保证更新逐渐收敛而不是震荡)
解释
- 第一条条件保证了 TD 算法不断吸收新信息,不会过早停止学习。
- 第二条条件保证了 TD 算法不会因为过大的更新幅度而在收敛点附近震荡。
学习率的取值
- 实践中,通常将 α t ( s ) \alpha_t(s) αt(s) 设为一个 小常数 (如 0.1 0.1 0.1)。
- 这样严格来说不满足 ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞,但在"期望意义"下依然能保证收敛到接近 v π ( s ) v_\pi(s) vπ(s)。
-
Algorithm properties
TD/Sarsa learning MC learning Online: TD learning is online. It can update the state/action values immediately after receiving a reward. Offline: MC learning is offline. It has to wait until an episode has been completely collected. Continuing tasks: Since TD learning is online, it can handle both episodic and continuing tasks. Episodic tasks: Since MC learning is offline, it can only handle episodic tasks that has terminate states. Bootstrapping: TD bootstraps because the update of a value relies on the previous estimate of this value. Hence, it requires initial guesses. Non-bootstrapping: MC is not bootstrapping, because it can directly estimate state/action values without any initial guess. Low estimation variance: TD has lower than MC because there are fewer random variables. For instance, Sarsa requires R t + 1 , S t + 1 , A t + 1 R_{t+1}, S_{t+1}, A_{t+1} Rt+1,St+1,At+1. High estimation variance: To estimate q π ( s t , a t ) q_\pi(s_t, a_t) qπ(st,at), we need samples of R t + 1 + γ R t + 2 + γ 2 R t + 3 + ... R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots Rt+1+γRt+2+γ2Rt+3+.... Suppose the length of each episode is L L L. There are $ TD (Temporal Difference) 学习和 MC (Monte Carlo) 学习都是 model-free 方法,但它们有显著的差异。
- 在线 vs 离线
- TD 是 在线更新 :每一步获得奖励 r t + 1 r_{t+1} rt+1 后,就能立刻更新 v ( s t ) v(s_t) v(st)。
- MC 是 离线更新:必须等整个 episode 结束,才能计算回报并更新。
- 含义 :TD 更适合 实时学习场景 ,MC 适合 完整轨迹收集 的情况。
- 任务类型
- TD :既能处理 episodic (有终止状态),也能处理 continuing tasks(无终止状态)。
- MC:只能处理 episodic 任务,因为需要完整回报作为更新目标。
- 含义:TD 更灵活,适合长期运行的系统(如智能体在无限时间的环境中学习)。
- Bootstrapping
- TD 使用 bootstrapping :即更新时依赖当前的估计值(如 v ( s t + 1 ) v(s_{t+1}) v(st+1))。
- MC 是 非 bootstrapping :直接使用完整回报 G t G_t Gt 更新。
- 含义:TD 更快,因为它不需要等完整 episode,但它依赖初始值。
- 估计方差
- TD:低方差,因为更新只依赖一个即时奖励和下一个状态估计。
- MC:高方差,因为回报包含很多随机变量,导致估计更不稳定(无偏估计)。
- 含义 :TD 学习通常比 MC 收敛更快、更稳定 ,但也可能引入 偏差 (bias),因为 bootstrapping 使用了近似的估计。
- 在线 vs 离线
Sarsa
Base Sarsa
Sarsa algorithm
-
First, our aim is to estimate the action values of a given policy π \pi π.
-
Suppose we have some experience ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) t {(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})}_t (st,at,rt+1,st+1,at+1)t.
-
We can use the following Sarsa algorithm to estimate the action values:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ q t ( s t + 1 , a t + 1 ) ] ] , q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - [r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})] \Big], qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]],
q t + 1 ( s , a ) = q t ( s , a ) , ∀ ( s , a ) ≠ ( s t , a t ) , q_{t+1}(s,a) = q_t(s,a), \quad \forall (s,a) \neq (s_t, a_t), qt+1(s,a)=qt(s,a),∀(s,a)=(st,at),
- where t = 0 , 1 , 2 , ... t = 0,1,2,\ldots t=0,1,2,...
- q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is an estimate of q π ( s t , a t ) q_\pi(s_t, a_t) qπ(st,at)
- α t ( s t , a t ) \alpha_t(s_t,a_t) αt(st,at) is the learning rate depending on ( s t , a t ) (s_t,a_t) (st,at)
Sarsa 更新公式
- 当前的估计值: q t ( s t , a t ) q_t(s_t,a_t) qt(st,at)
- TD Target: r t + 1 + γ q t ( s t + 1 , a t + 1 ) r_{t+1} + \gamma q_t(s_{t+1},a_{t+1}) rt+1+γqt(st+1,at+1)
- r t + 1 r_{t+1} rt+1:在执行动作 a t a_t at 后得到的即时奖励。
- γ q t ( s t + 1 , a t + 1 ) \gamma q_t(s_{t+1}, a_{t+1}) γqt(st+1,at+1):未来从 ( s t + 1 , a t + 1 ) (s_{t+1}, a_{t+1}) (st+1,at+1) 开始继续的长期回报估计。
- TD Error: q t ( s t , a t ) − TD Target q_t(s_t,a_t) - \text{TD Target} qt(st,at)−TD Target
每次更新,Sarsa 都会把 q ( s t , a t ) q(s_t,a_t) q(st,at) 往"下一步回报 + 下一步动作的估计值"这个目标拉近一些。
直观解释
- 我们对 ( s t , a t ) (s_t,a_t) (st,at) 的"好坏"有个旧估计 q t ( s t , a t ) q_t(s_t,a_t) qt(st,at)。
- 但现在我们看到了一步真实的奖励 r t + 1 r_{t+1} rt+1,以及下一步 ( s t + 1 , a t + 1 ) (s_{t+1},a_{t+1}) (st+1,at+1) 的估计值 q t ( s t + 1 , a t + 1 ) q_t(s_{t+1},a_{t+1}) qt(st+1,at+1)。
- 这就提供了一个新目标(TD Target)。
- Sarsa 更新时,不会直接替换,而是慢慢地把 q ( s t , a t ) q(s_t,a_t) q(st,at) 往这个目标拉近。
- where t = 0 , 1 , 2 , ... t = 0,1,2,\ldots t=0,1,2,...
-
Relationship between Sarsa and TD
- Replace v ( s ) v(s) v(s) in TD algorithm with q ( s , a ) q(s,a) q(s,a) → we obtain Sarsa.
- Sarsa is the action-value version of TD learning.
-
Mathematical Expression
-
The Sarsa algorithm solves:
q π ( s , a ) = E [ R + γ q π ( S ' , A ' ) ∣ s , a ] , ∀ s , a . q_\pi(s,a) = \mathbb{E}[R + \gamma q_\pi(S', A') \mid s,a], \quad \forall s,a. qπ(s,a)=E[R+γqπ(S',A')∣s,a],∀s,a.
-
This is another expression of the Bellman equation expressed in terms of action values.
-
Theorem (Convergence of Sarsa learning)
By the Sarsa algorithm, q t ( s , a ) q_t(s,a) qt(s,a) converges with probability 1 1 1 to the action value q π ( s , a ) q_\pi(s,a) qπ(s,a) as t → ∞ t \to \infty t→∞, for all ( s , a ) (s,a) (s,a), if ∑ t α t ( s , a ) = ∞ \sum_t \alpha_t(s,a) = \infty ∑tαt(s,a)=∞ and ∑ t α t 2 ( s , a ) < ∞ \sum_t \alpha_t^2(s,a) < \infty ∑tαt2(s,a)<∞.
- Remarks:
- This theorem says the action value can be found by Sarsa for a given policy π \pi π.
Pseudocode: Policy searching by Sarsa
- For each episode, do
- If current s_t is not the target state, do
- Collect experience ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}) (st,at,rt+1,st+1,at+1):
-
Take action a t ∼ π t ( s t ) a_t \sim \pi_t(s_t) at∼πt(st)
-
Generate r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1
-
Take action a t + 1 ∼ π t ( s t + 1 ) a_{t+1} \sim π_t(s_{t+1}) at+1∼πt(st+1)
-
Update q-value:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ q t ( s t + 1 , a t + 1 ) ) ] q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - \big( r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) \big) \Big] qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γqt(st+1,at+1))]
-
Update policy:
π t + 1 ( a ∣ s t ) = 1 − ϵ ∣ A ∣ ( ∣ A ∣ − 1 ) , if a = arg max a q t + 1 ( s t , a ) \pi_{t+1}(a \mid s_t) = 1 - \frac{\epsilon}{|\mathcal{A}|} (|\mathcal{A}| - 1), \quad \text{if } a = \arg \max_a q_{t+1}(s_t, a) πt+1(a∣st)=1−∣A∣ϵ(∣A∣−1),if a=argmaxaqt+1(st,a)
π t + 1 ( a ∣ s t ) = ϵ ∣ A ∣ , otherwise \pi_{t+1}(a \mid s_t) = \frac{\epsilon}{|\mathcal{A}|}, \quad \text{otherwise} πt+1(a∣st)=∣A∣ϵ,otherwise
-
- Collect experience ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}) (st,at,rt+1,st+1,at+1):
- If current s_t is not the target state, do
Sarsa 的伪代码可以概括为三步:
- 采样交互 : ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}) (st,at,rt+1,st+1,at+1)
- 更新 Q 值:往"即时奖励 + 下一步 Q"方向拉近
- 更新策略:根据最新 Q 值,调整动作选择概率
这样样就实现了 "一边学 Q 值,一边改策略" 的在线强化学习过程。
Remarks about Sarsa
- The policy of s t s_t st is updated immediately after q ( s t , a t ) q(s_t,a_t) q(st,at) is updated → based on Generalized Policy Iteration (GPI).
- The policy is ϵ \epsilon ϵ-greedy instead of greedy → balances exploitation and exploration.
Core Idea vs Complication
- Core idea: use an algorithm to solve the Bellman equation of a given policy.
- Complication : emerges when we try to find optimal policies and work efficiently.
Expected Sarsa
Algorithm
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ E [ q t ( s t + 1 , A ) ] ) ] , q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]\big)\Big], qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γE[qt(st+1,A)])],
q t + 1 ( s , a ) = q t ( s , a ) , ∀ ( s , a ) ≠ ( s t , a t ) , q_{t+1}(s, a) = q_t(s, a), \quad \forall (s,a) \neq (s_t,a_t), qt+1(s,a)=qt(s,a),∀(s,a)=(st,at),
-
where
E [ q t ( s t + 1 , A ) ) ] = ∑ a π t ( a ∣ s t + 1 ) q t ( s t + 1 , a ) ≐ v t ( s t + 1 ) \mathbb{E}[q_t(s_{t+1}, A))] = \sum_a \pi_t(a \mid s_{t+1}) q_t(s_{t+1}, a) \doteq v_t(s_{t+1}) E[qt(st+1,A))]=∑aπt(a∣st+1)qt(st+1,a)≐vt(st+1)
-
is the expected value of q t ( s t + 1 , a ) q_t(s_{t+1}, a) qt(st+1,a) under policy π t \pi_t πt.
Compared to Sarsa:
- The TD Target is changed from r t + 1 + γ q t ( s t + 1 , a t + 1 ) r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) rt+1+γqt(st+1,at+1) (as in Sarsa) to r t + 1 + γ E [ q t ( s t + 1 , A ) ] r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)] rt+1+γE[qt(st+1,A)] (as in Expected Sarsa).
- Need more computation. But it is beneficial in the sense that it reduces the estimation variances because it reduces random variables in Sarsa from s t , a t , r t + 1 , s t + 1 , a t + 1 {s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}} st,at,rt+1,st+1,at+1 to s t , a t , r t + 1 , s t + 1 {s_t, a_t, r_{t+1}, s_{t+1}} st,at,rt+1,st+1.
What does the algorithm do mathematically?
-
Expected Sarsa is a stochastic approximation algorithm for solving the following equation:
q π ( s , a ) = E [ R t + 1 + γ E A t + 1 ∼ π ( S t + 1 ) [ q π ( S t + 1 , A t + 1 ) ] , ∣ , S t = s , A t = a ] , ∀ s , a . q_\pi(s,a) = \mathbb{E}\Big[ R_{t+1} + \gamma \mathbb{E}{A{t+1} \sim \pi(S_{t+1})}[q_\pi(S_{t+1}, A_{t+1})] ,\Big|, S_t=s, A_t=a \Big], \quad \forall s,a. qπ(s,a)=E[Rt+1+γEAt+1∼π(St+1)[qπ(St+1,At+1)], ,St=s,At=a],∀s,a.
-
The above equation is another expression of the Bellman equation:
q π ( s , a ) = E [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s , A t = a ] . q_\pi(s,a) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a]. qπ(s,a)=E[Rt+1+γvπ(St+1)∣St=s,At=a].
n n n-step Sarsa
Introduction
-
The definition of action value is
q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] . q_\pi(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a]. qπ(s,a)=E[Gt∣St=s,At=a].
-
The discounted return G t G_t Gt can be written in different forms as
-
Sarsa
G t ( 1 ) = R t + 1 + γ q π ( S t + 1 , A t + 1 ) , G_t^{(1)} = R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}), Gt(1)=Rt+1+γqπ(St+1,At+1),
G t ( 2 ) = R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , A t + 2 ) , G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, A_{t+2}), Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),
⋮ \vdots ⋮
-
n-step Sarsa
G t ( n ) = R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) , G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}), Gt(n)=Rt+1+γRt+2+⋯+γnqπ(St+n,At+n),
⋮ \vdots ⋮
-
MC
G t ( ∞ ) = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots Gt(∞)=Rt+1+γRt+2+γ2Rt+3+⋯
-
-
It should be noted that
G t = G t ( 1 ) = G t ( 2 ) = G t ( n ) = G t ( ∞ ) , G_t = G_t^{(1)} = G_t^{(2)} = G_t^{(n)} = G_t^{(\infty)}, Gt=Gt(1)=Gt(2)=Gt(n)=Gt(∞),
-
where the superscripts merely indicate the different decomposition structures of G t G_t Gt.
Algorithm analysis
-
Sarsa aims to solve
q π ( s , a ) = E [ G t ( 1 ) ∣ s , a ] = E [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ s , a ] . q_\pi(s,a) = \mathbb{E}[G_t^{(1)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid s,a]. qπ(s,a)=E[Gt(1)∣s,a]=E[Rt+1+γqπ(St+1,At+1)∣s,a].
-
MC learning aims to solve
q π ( s , a ) = E [ G t ( ∞ ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ ∣ s , a ] . q_\pi(s,a) = \mathbb{E}[G_t^{(\infty)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid s,a]. qπ(s,a)=E[Gt(∞)∣s,a]=E[Rt+1+γRt+2+γ2Rt+3+⋯∣s,a].
-
An intermediate algorithm called n-step Sarsa aims to solve
q π ( s , a ) = E [ G t ( n ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ∣ s , a ] . q_\pi(s,a) = \mathbb{E}[G_t^{(n)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}) \mid s,a]. qπ(s,a)=E[Gt(n)∣s,a]=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣s,a].
-
The algorithm of n-step Sarsa is
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ r t + 2 + ⋯ + γ n q t ( s t + n , a t + n ) ) ] . q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)\Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n}, a_{t+n})\big)\Big]. qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γrt+2+⋯+γnqt(st+n,at+n))].
-
n n n-step Sarsa is more general because it becomes the (one-step) Sarsa algorithm when n = 1 n=1 n=1 and the MC learning algorithm when n = ∞ n=\infty n=∞.
Properties
-
n n n-step Sarsa needs
( s t , a t , r t + 1 , s t + 1 , a t + 1 , ... , r t + n , s t + n , a t + n ) . (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \ldots, r_{t+n}, s_{t+n}, a_{t+n}). (st,at,rt+1,st+1,at+1,...,rt+n,st+n,at+n).
-
Since ( r t + n , s t + n , a t + n ) (r_{t+n}, s_{t+n}, a_{t+n}) (rt+n,st+n,at+n) has not been collected at time t t t, we are not able to implement n-step Sarsa at step t t t. However, we can wait until time t + n t+n t+n to update the q-value of ( s t , a t ) (s_t,a_t) (st,at):
q t + n ( s t , a t ) = q t + n − 1 ( s t , a t ) − α t + n − 1 ( s t , a t ) [ q t + n − 1 ( s t , a t ) − ( r t + 1 + γ r t + 2 + ⋯ + γ n q t + n − 1 ( s t + n , a t + n ) ) ] . q_{t+n}(s_t, a_t) = q_{t+n-1}(s_t, a_t) - \alpha_{t+n-1}(s_t,a_t)\Big[q_{t+n-1}(s_t,a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_{t+n-1}(s_{t+n}, a_{t+n})\big)\Big]. qt+n(st,at)=qt+n−1(st,at)−αt+n−1(st,at)[qt+n−1(st,at)−(rt+1+γrt+2+⋯+γnqt+n−1(st+n,at+n))].
-
Since n n n-step Sarsa includes Sarsa and MC learning as two extreme cases, its performance is a blend of Sarsa and MC learning:
- If n n n is large, its performance is close to MC learning and hence has a large variance but a small bias.
- If n n n is small, its performance is close to Sarsa and hence has a relatively large bias due to the initial guess and relatively low variance.
-
Finally, n n n-step Sarsa is also for policy evaluation. It can be combined with the policy improvement step to search for optimal policies.
总结
Sarsa 是 on-policy,学习并评估当前执行的策略;Q-learning 是 off-policy,利用任意行为策略采样却始终朝最优策略收敛;TD 方法则是统一的核心框架,通过 TD Target 和 TD Error 将价值估计逐步逼近 Bellman 方程的解。