RL【7-1】：Temporal-difference Learning

系列文章目录

文章目录

系列文章目录
- [Fundamental Tools](#Fundamental Tools)
- Algorithm
- Method
前言
[Stochastic Algorithms](#Stochastic Algorithms)
[TD Learning of State Values](#TD Learning of State Values)
Sarsa
- [Base Sarsa](#Base Sarsa)
- [Expected Sarsa](#Expected Sarsa)
- [n n n-step Sarsa](#n n n-step Sarsa)
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：

B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】

GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Stochastic Algorithms

First: Simple mean estimation problem

Calculate w = E [ X ] , w = \mathbb{E}[X], w=E[X], based on some i.i.d. samples { x } \{x\} {x} of X X X.

By writing g ( w ) = w − E [ X ] g(w) = w - \mathbb{E}[X] g(w)=w−E[X], we can reformulate the problem to a root-finding problem: g ( w ) = 0 g(w) = 0 g(w)=0.
Since we can only obtain samples {x} of X, the noisy observation is

g ~ ( w , η ) = w − x = ( w − E [ X ] ) + ( E [ X ] − x ) ≐ g ( w ) + η . \tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta. g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η.
Then, according to the RM algorithm, solving g(w)=0:

w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k ( w k − x k ) . w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k (w_k - x_k). wk+1=wk−αkg~(wk,ηk)=wk−αk(wk−xk).

问题背景

我们想要求解某个随机变量 X 的数学期望：

w = E [ X ] , w = \mathbb{E}[X], w=E[X],

但我们不能直接得到 E [ X ] \mathbb{E}[X] E[X]，只能获取一组来自 X X X 的 i.i.d. 样本 { x } \{x\} {x}。

转换为根寻找问题

我们把问题写成如下形式：

g ( w ) = w − E [ X ] = 0. g(w) = w - \mathbb{E}[X] = 0. g(w)=w−E[X]=0.

也就是说，如果我们能找到使得 g ( w ) = 0 的 w g(w)=0 的 w g(w)=0的w，那么这个解就是 E [ X ] \mathbb{E}[X] E[X]。

噪声观测

因为只能观测到样本 x x x，所以我们实际上得到的不是 g ( w ) g(w) g(w)，而是一个带噪声的观测：

g ~ ( w , η ) = w − x = ( w − E [ X ] ) + ( E [ X ] − x ) ≐ g ( w ) + η , \tilde{g}(w,\eta) = w - x = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta, g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η,

其中， η = E [ X ] − x \eta = \mathbb{E}[X] - x η=E[X]−x 表示噪声，它的期望为 0 0 0。

RM 更新公式

Robbins--Monro 算法通过迭代更新 w w w，逐步收敛到正确的 E [ X ] \mathbb{E}[X] E[X]。其更新公式为：

w k + 1 = w k − α k g ~ ( w k , η k ) , w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k), wk+1=wk−αkg~(wk,ηk),

代入上面的噪声观测：

w k + 1 = w k − α k ( w k − x k ) . w_{k+1} = w_k - \alpha_k (w_k - x_k). wk+1=wk−αk(wk−xk).

算法直观解释

当前估计为 w k w_k wk。

我们拿到一个新的样本 x k x_k xk。

用差值 w k − x k w_k - x_k wk−xk 来修正估计：如果 w k w_k wk 大于样本 x k x_k xk，更新会往下调；反之则往上调。

α k \alpha_k αk 是步长，通常随迭代次数减小（比如 α k = 1 / k \alpha_k = 1/k αk=1/k），保证算法收敛。

Second: A more complex problem

Estimate the mean of a function v ( X ) v(X) v(X): w = E [ v ( X ) ] w = \mathbb{E}[v(X)] w=E[v(X)], based on some i.i.d. random samples { x } \{x\} {x} of X X X.

To solve this problem, we define

g ( w ) = w − E [ v ( X ) ] g(w) = w - \mathbb{E}[v(X)] g(w)=w−E[v(X)],

g ~ ( w , η ) = w − v ( x ) = ( w − E [ v ( X ) ] ) + ( E [ v ( X ) ] − v ( x ) ) ≐ g ( w ) + η . \tilde{g}(w,\eta) = w - v(x) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta. g~(w,η)=w−v(x)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η.
Then, the problem becomes a root-finding problem: g ( w ) = 0 g(w) = 0 g(w)=0. The corresponding RM algorithm is

w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k [ w k − v ( x k ) ] . w_{k+1} = w_k - \alpha_k \tilde{g}(w_k,\eta_k) = w_k - \alpha_k [w_k - v(x_k)]. wk+1=wk−αkg~(wk,ηk)=wk−αk[wk−v(xk)].

问题背景

我们不再直接估计随机变量 X X X 的均值，而是希望估计某个函数 v ( X ) v(X) v(X) 的期望：

w = E [ v ( X ) ] , w = \mathbb{E}[v(X)] , w=E[v(X)],

其中 v ( ⋅ ) v(\cdot) v(⋅) 是已知函数， X X X 是随机变量。我们能够获得 X X X 的 i.i.d. 样本 { x } \{x\} {x}，但无法直接得到 E [ v ( X ) ] \mathbb{E}[v(X)] E[v(X)]。

转换为根寻找问题

类似均值估计问题，我们将目标改写为一个根寻找问题：

g ( w ) = w − E [ v ( X ) ] = 0. g(w) = w - \mathbb{E}[v(X)] = 0. g(w)=w−E[v(X)]=0.

噪声观测

我们无法直接计算 E [ v ( X ) ] \mathbb{E}[v(X)] E[v(X)]，但可以通过样本 x x x 进行观测。于是定义一个带噪声的观测函数：

g ~ ( w , η ) = w − v ( x ) . \tilde{g}(w, \eta) = w - v(x). g~(w,η)=w−v(x).

展开来看：

g ~ ( w , η ) = ( w − E [ v ( X ) ] ) + ( E [ v ( X ) ] − v ( x ) ) ≐ g ( w ) + η , \tilde{g}(w, \eta) = (w - \mathbb{E}[v(X)]) + (\mathbb{E}[v(X)] - v(x)) \doteq g(w) + \eta, g~(w,η)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η,

其中 η = E [ v ( X ) ] − v ( x ) \eta = \mathbb{E}[v(X)] - v(x) η=E[v(X)]−v(x)，是零均值的噪声。

RM 更新公式

Robbins--Monro 算法通过迭代更新 w w w，使其收敛到 E [ v ( X ) ] \mathbb{E}[v(X)] E[v(X)]。更新公式为：

w k + 1 = w k − α k g ~ ( w k , η k ) . w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k). wk+1=wk−αkg~(wk,ηk).

代入 g ~ ( w , η ) \tilde{g}(w,\eta) g~(w,η) 得到：

w k + 1 = w k − α k [ w k − v ( x k ) ] . w_{k+1} = w_k - \alpha_k [w_k - v(x_k)]. wk+1=wk−αk[wk−v(xk)].

算法直观解释

当前估计为 w k w_k wk。

我们用一个新样本 x k x_k xk，计算函数值 v ( x k ) v(x_k) v(xk)。

更新公式会逐步把 w k w_k wk 调整到 v ( x k ) v(x_k) v(xk) 的方向。

多次迭代后， w k w_k wk 会收敛到所有样本的平均值，即 E [ v ( X ) ] \mathbb{E}[v(X)] E[v(X)]。

Third: An even more complex problem

Calculate w = E [ R + γ v ( X ) ] , w = \mathbb{E}[R + \gamma v(X)], w=E[R+γv(X)], where R , X R, X R,X are random variables, γ \gamma γ is a constant, and v ( ⋅ ) v(\cdot) v(⋅) is a function.

Suppose we can obtain samples { x } \{x\} {x} and { r } \{r\} {r} of X X X and R R R. we define

g ( w ) = w − E [ R + γ v ( X ) ] , g(w) = w - \mathbb{E}[R + \gamma v(X)], g(w)=w−E[R+γv(X)],

g ~ ( w , η ) = w − [ r + γ v ( x ) ] = ( w − E [ R + γ v ( X ) ] ) + ( E [ R + γ v ( X ) ] − [ r + γ v ( x ) ] ) ≐ g ( w ) + η . \tilde{g}(w,\eta) = w - [r + \gamma v(x)] = (w - \mathbb{E}[R + \gamma v(X)]) + (\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]) \doteq g(w) + \eta. g~(w,η)=w−[r+γv(x)]=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)])≐g(w)+η.
Then, the problem becomes a root-finding problem: g ( w ) = 0 g(w) = 0 g(w)=0. The corresponding RM algorithm is

问题背景

我们希望估计期望值：

w = E [ R + γ v ( X ) ] , w = \mathbb{E}[R + \gamma v(X)], w=E[R+γv(X)],

其中：

R , X R, X R,X 是随机变量；

γ \gamma γ 是一个常数；

v ( ⋅ ) v(\cdot) v(⋅) 是一个函数。

也就是说，目标是同时考虑随机奖励 R R R 和函数 v ( X ) v(X) v(X) 的加权期望。

转换为根寻找问题

将问题改写为求解方程 g ( w ) = 0 g(w) = 0 g(w)=0：

g ( w ) = w − E [ R + γ v ( X ) ] . g(w) = w - \mathbb{E}[R + \gamma v(X)]. g(w)=w−E[R+γv(X)].

显然，解为 w ⋆ = E [ R + γ v ( X ) ] w^\star = \mathbb{E}[R + \gamma v(X)] w⋆=E[R+γv(X)]。

噪声观测

由于我们无法直接得到 E [ R + γ v ( X ) ] \mathbb{E}[R + \gamma v(X)] E[R+γv(X)]，只能通过样本 ( r , x ) (r, x) (r,x) 来观测：

g ~ ( w , η ) = w − [ r + γ v ( x ) ] . \tilde{g}(w, \eta) = w - [r + \gamma v(x)]. g~(w,η)=w−[r+γv(x)].

展开来看：

g ~ ( w , η ) = ( w − E [ R + γ v ( X ) ] ) + ( E [ R + γ v ( X ) ] − [ r + γ v ( x ) ] ) . \tilde{g}(w, \eta) = (w - \mathbb{E}[R + \gamma v(X)]) + \big(\mathbb{E}[R + \gamma v(X)] - [r + \gamma v(x)]\big). g~(w,η)=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)]).

于是可以写作：

g ~ ( w , η ) ≐ g ( w ) + η , \tilde{g}(w, \eta) \doteq g(w) + \eta, g~(w,η)≐g(w)+η,

其中 η \eta η 是零均值噪声项。

RM 算法迭代公式

Robbins--Monro 算法通过递推公式更新 w w w，逐渐逼近最优解：

w k + 1 = w k − α k g ~ ( w k , η k ) . w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k). wk+1=wk−αkg~(wk,ηk).

代入具体的噪声观测函数，得到：

w k + 1 = w k − α k ( w k − ( r k + γ v ( x k ) ) ) . w_{k+1} = w_k - \alpha_k \Big(w_k - (r_k + \gamma v(x_k))\Big). wk+1=wk−αk(wk−(rk+γv(xk))).

直观解释

当前估计为 w k w_k wk；

用样本 ( r k , x k ) (r_k, x_k) (rk,xk) 计算近似目标值 r k + γ v ( x k ) r_k + \gamma v(x_k) rk+γv(xk)；

更新公式让 w k w_k wk 向这个近似目标靠近；

随着迭代次数增加， w k w_k wk 会收敛到 E [ R + γ v ( X ) ] \mathbb{E}[R + \gamma v(X)] E[R+γv(X)]。

TD Learning of State Values

Algorithm description

The data/experience required by the algorithm:
- ( s 0 , r 1 , s 1 , ... , s t , r t + 1 , s t + 1 , ... ) (s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots) (s0,r1,s1,...,st,rt+1,st+1,...) or { ( s t , r t + 1 , s t + 1 ) } t \{(s_t, r_{t+1}, s_{t+1})\}_t {(st,rt+1,st+1)}t generated following the given policy π \pi π.
The TD learning algorithm is

v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − ( r t + 1 + γ v t ( s t + 1 ) ) ] ( 1 ) v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1) vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)

v t + 1 ( s ) = v t ( s ) , ∀ s ≠ s t ( 2 ) v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2) vt+1(s)=vt(s),∀s=st(2)
- where t = 0 , 1 , 2 , ... t = 0,1,2,\ldots t=0,1,2,.... Here, v t ( s t ) v_t(s_t) vt(st) is the estimated state value of v π ( s t ) v_\pi(s_t) vπ(st); α t ( s t ) \alpha_t(s_t) αt(st) is the learning rate of s t s_t st at time t t t.
- At time t t t, only the value of the visited state s t s_t st is updated whereas the values of the unvisited states s ≠ s t s \neq s_t s=st remain unchanged.

背景

在强化学习中，我们希望估计某个策略 π\pi 下的 状态价值函数：

v π ( s ) = E [ G t ∣ S t = s , π ] , v_\pi(s) = \mathbb{E}[G_t \mid S_t = s, \pi], vπ(s)=E[Gt∣St=s,π],

其中 G t G_t Gt 是从状态 s s s 出发得到的未来累计回报。

但是我们往往 没有环境的模型（转移概率/奖励分布），所以不能直接用 Bellman 方程去算，只能用采样到的轨迹数据来更新估计值。

算法需要的数据

算法只需要从策略 π\pi 下采样的轨迹：

完整轨迹 ： ( s 0 , r 1 , s 1 , ... , s t , r t + 1 , s t + 1 , ... ) (s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots) (s0,r1,s1,...,st,rt+1,st+1,...)，或者

三元组集合 ： { ( s t , r t + 1 , s t + 1 ) } t \{(s_t, r_{t+1}, s_{t+1})\}_t {(st,rt+1,st+1)}t

这意味着 TD 学习可以 在线学习，只需一小步经验（状态-奖励-下一个状态）即可更新。

TD 更新公式

更新访问过的状态：

v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − ( r t + 1 + γ v t ( s t + 1 ) ) ] ( 1 ) v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \Big[ v_t(s_t) - \big( r_{t+1} + \gamma v_t(s_{t+1}) \big) \Big] \quad (1) vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))](1)

未访问的状态不变：

v t + 1 ( s ) = v t ( s ) , ∀ s ≠ s t ( 2 ) v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \quad (2) vt+1(s)=vt(s),∀s=st(2)

含义

更新公式说明，当前状态 s t s_t st 的价值估计 会朝着 TD Target 靠拢：

TD Target = r t + 1 + γ v t ( s t + 1 ) , \text{TD Target} = r_{t+1} + \gamma v_t(s_{t+1}), TD Target=rt+1+γvt(st+1),

即"一步奖励 + 折扣后的下一状态价值"。

更新量由 TD Error 控制：

δ t = ( r t + 1 + γ v t ( s t + 1 ) ) − v t ( s t ) . \delta_t = \big(r_{t+1} + \gamma v_t(s_{t+1})\big) - v_t(s_t). δt=(rt+1+γvt(st+1))−vt(st).

它刻画了"预测"和"实际一步观察"之间的差异。

学习率 α t ( s t ) \alpha_t(s_t) αt(st) 决定了更新幅度。

如果 α \alpha α 较大，更新更快但不稳定。

如果 α \alpha α 较小，更新更慢但更稳定。

Algorithm properties

The TD algorithm can be annotated as

$v_{t+1}(s_t)

= \underbrace{v_t(s_t)}_{\text{current estimate}}|
- \alpha_t(s_t)\Big[\underbrace{v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})]}_{\substack{\text{TD error } \delta_t \ \text{TD target } \bar v_t}}\Big],
  
  \quad (3)$
- Here,
  
  v ˉ t ≐ r t + 1 + γ v ( s t + 1 ) \bar v_t \doteq r_{t+1} + \gamma v(s_{t+1}) vˉt≐rt+1+γv(st+1)
  - is called the TD Target.
  δ t ≐ v ( s t ) − [ r t + 1 + γ v ( s t + 1 ) ] = v ( s t ) − v ˉ t \delta_t \doteq v(s_t) - [r_{t+1} + \gamma v(s_{t+1})] = v(s_t) - \bar v_t δt≐v(st)−[rt+1+γv(st+1)]=v(st)−vˉt
  - is called the TD error.
- It is clear that the new estimate v t + 1 ( s t ) v_{t+1}(s_t) vt+1(st) is a combination of the current estimate v t ( s t ) v_t(s_t) vt(st) and the TD error.
First, why is v ˉ t \bar v_t vˉt called the TD Target?
- That is because the algorithm drives v ( s t ) v(s_t) v(st) towards v ˉ t \bar v_t vˉt.
- To see that,
  
  v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − v ˉ t ] v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - \bar v_t] vt+1(st)=vt(st)−αt(st)[vt(st)−vˉt]
  
  ⟹ v t + 1 ( s t ) − v ˉ t = v t ( s t ) − v ˉ t − α t ( s t ) [ v t ( s t ) − v ˉ t ] \implies v_{t+1}(s_t) - \bar v_t = v_t(s_t) - \bar v_t - \alpha_t(s_t)[v_t(s_t) - \bar v_t] ⟹vt+1(st)−vˉt=vt(st)−vˉt−αt(st)[vt(st)−vˉt]
  
  ⟹ v t + 1 ( s t ) − v ˉ t = [ 1 − α t ( s t ) ] [ v t ( s t ) − v ˉ t ] \implies v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)][v_t(s_t) - \bar v_t] ⟹vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt]
  
  ⟹ ∣ v t + 1 ( s t ) − v ˉ t ∣ = ∣ 1 − α t ( s t ) ∣ ∣ v t ( s t ) − v ˉ t ∣ \implies |v_{t+1}(s_t) - \bar v_t| = |1 - \alpha_t(s_t)||v_t(s_t) - \bar v_t| ⟹∣vt+1(st)−vˉt∣=∣1−αt(st)∣∣vt(st)−vˉt∣
- Since α t ( s t ) \alpha_t(s_t) αt(st) is a small positive number, we have
  
  0 < 1 − α t ( s t ) < 1 0 < 1 - \alpha_t(s_t) < 1 0<1−αt(st)<1
  - Therefore,
    
    ∣ v t + 1 ( s t ) − v ˉ t ∣ ≤ ∣ v t ( s t ) − v ˉ t ∣ |v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t| ∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣
  - which means v ( s t ) v(s_t) v(st) is driven towards v ˉ t \bar v_t vˉt!
Second, what is the interpretation of the TD error?

δ t = v ( s t ) − [ r t + 1 + γ v ( s t + 1 ) ] \delta_t = v(s_t) - [r_{t+1} + \gamma v(s_{t+1})] δt=v(st)−[rt+1+γv(st+1)]
- It is a difference between two consequent time steps.
- It reflects the deficiency between v t v_t vt and v π v_\pi vπ.
  
  To see that, denote
  
  δ π , t ≐ v π ( s t ) − [ r t + 1 + γ v π ( s t + 1 ) ] \delta_{\pi,t} \doteq v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})] δπ,t≐vπ(st)−[rt+1+γvπ(st+1)]
- Note that
  
  E [ δ π , t ∣ S t = s t ] = v π ( s t ) − E [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s t ] = 0. \mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t] = 0. E[δπ,t∣St=st]=vπ(st)−E[Rt+1+γvπ(St+1)∣St=st]=0.
  - If v t = v π v_t = v_\pi vt=vπ, then δ t \delta_t δt should be zero (in the expectation sense).
  - Hence, if δ t \delta_t δt is not zero, then v t v_t vt is not equal to v π v_\pi vπ.
- The TD error can be interpreted as innovation , which means new information obtained from the experience ( s t , r t + 1 , s t + 1 ) (s_t, r_{t+1}, s_{t+1}) (st,rt+1,st+1).
Other properties:
- The TD algorithm in (3) only estimates the state value of a given policy .
  - It does not estimate the action values.
  - It does not search for optimal policies.
- Later, we will see how to estimate action values and then search for optimal policies.
- Nonetheless, the TD algorithm in (3) is fundamental for understanding the core idea.

Explanation of TD Algorithm Properties

TD 更新公式回顾

TD 的核心更新公式是：

v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − ( r t + 1 + γ v t ( s t + 1 ) ) ] . v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\Big[v_t(s_t) - \big(r_{t+1} + \gamma v_t(s_{t+1})\big)\Big]. vt+1(st)=vt(st)−αt(st)[vt(st)−(rt+1+γvt(st+1))].

它包含三部分：

当前估计 (current estimate)： v t ( s t ) v_t(s_t) vt(st)

TD Target： v ˉ t = r t + 1 + γ v t ( s t + 1 ) \bar v_t = r_{t+1} + \gamma v_t(s_{t+1}) vˉt=rt+1+γvt(st+1)

TD Error： δ t = v t ( s t ) − v ˉ t \delta_t = v_t(s_t) - \bar v_t δt=vt(st)−vˉt

因此，更新的本质就是：

v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) δ t , v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\delta_t, vt+1(st)=vt(st)−αt(st)δt,

即在当前估计的基础上，减去与 TD Target 的差值。

为什么 v ˉ t \bar v_t vˉt 被称为 TD Target？

直观解释

算法的目标是让 v ( s t ) v(s_t) v(st) 逐渐逼近 v ˉ t \bar v_t vˉt。

每次更新时，都会缩小两者之间的差距。

推导过程

v t + 1 ( s t ) − v ˉ t = [ 1 − α t ( s t ) ] [ v t ( s t ) − v ˉ t ] . v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)]\,[v_t(s_t) - \bar v_t]. vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt].

因为学习率 α t ( s t ) \alpha_t(s_t) αt(st) 在 0 < α t ( s t ) < 1 0< \alpha_t(s_t) < 1 0<αt(st)<1，所以每次迭代都会让

∣ v t + 1 ( s t ) − v ˉ t ∣ ≤ ∣ v t ( s t ) − v ˉ t ∣ . |v_{t+1}(s_t) - \bar v_t| \le |v_t(s_t) - \bar v_t|. ∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣.

这说明 估计值一步步朝着 TD Target 收敛。

TD Error 的解释

定义：

δ t = v ( s t ) − ( r t + 1 + γ v ( s t + 1 ) ) . \delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big). δt=v(st)−(rt+1+γv(st+1)).

含义：

它是 两个连续时间步之间的差值；

反映了当前估计 v t v_t vt 与真实价值函数 v π v_\pi vπ 的差距。

关键结论：

E [ δ π , t ∣ S t = s t ] = 0 , 当且仅当 v t = v π \mathbb{E}[\delta_{\pi,t} \mid S_t = s_t] = 0, \quad \text{当且仅当 } v_t = v_\pi E[δπ,t∣St=st]=0,当且仅当 vt=vπ

如果 δ t = 0 \delta_t = 0 δt=0，说明估计完全正确；

如果 δ t ≠ 0 \delta_t \neq 0 δt=0，说明估计与真实价值有偏差。

直观理解：

TD Error 可以理解为 创新 (innovation)，即每次从经验中获得的新信息。

更新就是通过 TD Error 将估计逐步修正。
Further Explanation: TD Target & TD Error

TD Target v ˉ t \bar v_t vˉt

v ˉ t ≐ r t + 1 + γ v ( s t + 1 ) \bar v_t \doteq r_{t+1} + \gamma v(s_{t+1}) vˉt≐rt+1+γv(st+1)

直观含义：

它是 下一时刻回报的估计 ，由当前奖励 r t + 1 r_{t+1} rt+1 和未来状态价值 v ( s t + 1 ) v(s_{t+1}) v(st+1) 构成。

可以看作是 一步预测 (one-step lookahead) ：从当前状态 s t s_t st 出发，走一步，收集到即时奖励，再加上下一个状态的估值。

类比：

如果把 v ( s t ) v(s_t) v(st) 看作我们对房价的预测，那么 v ˉ t \bar v_t vˉt 就是"根据最新成交价 + 未来走势"的修正目标。

每次更新就是让预测逐渐接近这个"修正后的目标"。

为什么是 target？

因为在 Bellman 方程中：

v π ( s ) = E [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s ] , v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s], vπ(s)=E[Rt+1+γvπ(St+1)∣St=s],

TD Target v ˉ t \bar v_t vˉt 正是右边的样本近似。

所以 TD Target 是 Bellman 方程的局部实现。

TD Error δ t \delta_t δt

δ t = v ( s t ) − ( r t + 1 + γ v ( s t + 1 ) ) = v ( s t ) − v ˉ t \delta_t = v(s_t) - \big(r_{t+1} + \gamma v(s_{t+1})\big) = v(s_t) - \bar v_t δt=v(st)−(rt+1+γv(st+1))=v(st)−vˉt

直观含义：

它是 当前估计 与 TD Target 的差距。

如果差距为零，说明预测完美；否则，差距的大小和符号告诉我们更新的方向。

解释 1：预测误差

δ t \delta_t δt 就像是在问："我的预测 v ( s t ) v(s_t) v(st)，和根据实际观察到的奖励修正后的预测 v ˉ t \bar v_t vˉt，差多少？"

解释 2：学习信号

δ t \delta_t δt 是 更新的驱动力：

v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) δ t v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)\,\delta_t vt+1(st)=vt(st)−αt(st)δt

如果 δ t > 0 \delta_t > 0 δt>0：说明预测过高，要下降；

如果 δ t < 0 \delta_t < 0 δt<0：说明预测过低，要上升。

解释 3：创新 (innovation)

在统计学里，innovation 指新信息与已有预测之间的差异。

在 TD 中， δ t \delta_t δt 就是 从经验中获得的新信息，它衡量了"我们学到的和我们以为的之间的差异"。

TD Target & TD Error 的关系

TD Target 提供 学习的目标；

TD Error 衡量 当前预测与目标的差距；

更新规则就是：

New Estimate = Old Estimate − Learning Rate × TD Error . \text{New Estimate} = \text{Old Estimate} - \text{Learning Rate} \times \text{TD Error}. New Estimate=Old Estimate−Learning Rate×TD Error.

即：

TD Target = "我应该往哪走"

TD Error = "我现在离目标有多远"

TD Update = "往目标迈一小步"

The idea of the algorithm

First, a new expression of the Bellman equation
- The definition of state value of π \pi π is
  
  v π ( s ) = E [ R + γ G ∣ S = s ] , s ∈ S ( 4 ) v_\pi(s) = \mathbb{E}[R + \gamma G \mid S = s], \quad s \in \mathcal{S} \quad (4) vπ(s)=E[R+γG∣S=s],s∈S(4)
- where G G G is discounted return. Since
  
  E [ G ∣ S = s ] = ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) = E [ v π ( S ′ ) ∣ S = s ] , \mathbb{E}[G \mid S = s] = \sum_a \pi(a \mid s) \sum_{s'} p(s' \mid s,a) v_\pi(s') = \mathbb{E}[v_\pi(S') \mid S = s], E[G∣S=s]=∑aπ(a∣s)∑s′p(s′∣s,a)vπ(s′)=E[vπ(S′)∣S=s],
- where S ′ S' S′ is the next state, we can rewrite (4) as
  
  v π ( s ) = E [ R + γ v π ( S ′ ) ∣ S = s ] , s ∈ S . ( 5 ) v_\pi(s) = \mathbb{E}[R + \gamma v_\pi(S') \mid S = s], \quad s \in \mathcal{S}. \quad (5) vπ(s)=E[R+γvπ(S′)∣S=s],s∈S.(5)
- Equation (5) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation, an important tool to design and analyze TD algorithms.
这说明 当前状态的价值 可以用 一步奖励 + 下一状态价值的期望 来表示。
Second, solve the Bellman equation in (5) using the RM algorithm
- In particular, by defining
  
  g ( v ( s ) ) = v ( s ) − E [ R + γ v π ( S ′ ) ∣ s ] , g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s], g(v(s))=v(s)−E[R+γvπ(S′)∣s],
- we can rewrite (5) as
  
  g ( v ( s ) ) = 0. g(v(s)) = 0. g(v(s))=0.
- Since we can only obtain the samples r r r and s ′ s' s′ of R R R and S ′ S' S′, the noisy observation we have is
  
  g ~ ( v ( s ) ) = v ( s ) − [ r + γ v π ( s ′ ) ] = ( v ( s ) − E [ R + γ v π ( S ′ ) ∣ s ] ) ⏟ g ( v ( s ) ) + ( E [ R + γ v π ( S ′ ) ∣ s ] − [ r + γ v π ( s ′ ) ] ) ⏟ η . \tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')] = \underbrace{(v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s])}{g(v(s))} + \underbrace{(\mathbb{E}[R + \gamma v\pi(S') \mid s] - [r + \gamma v_\pi(s')])}_{\eta}. g~(v(s))=v(s)−[r+γvπ(s′)]=g(v(s)) (v(s)−E[R+γvπ(S′)∣s])+η (E[R+γvπ(S′)∣s]−[r+γvπ(s′)]).
用 RM 算法求解 Bellman 方程
- Bellman 方程形式上是一个 不动点方程：
  
  v ( s ) = E [ R + γ v π ( S ′ ) ∣ s ] . v(s) = \mathbb{E}[R + \gamma v_\pi(S') \mid s]. v(s)=E[R+γvπ(S′)∣s].
- 我们可以改写为 零点问题：
  
  g ( v ( s ) ) = v ( s ) − E [ R + γ v π ( S ′ ) ∣ s ] = 0. g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') \mid s] = 0. g(v(s))=v(s)−E[R+γvπ(S′)∣s]=0.
- 为什么要改写成 g ( v ( s ) ) = 0 g(v(s))=0 g(v(s))=0？
  - 这是为了使用 随机逼近（Robbins-Monro, RM）算法。
  - RM 专门用于在存在噪声的情况下求解零点问题。
- RM 的做法
  - 我们无法直接计算 E [ R + γ v π ( S ′ ) ∣ s ] \mathbb{E}[R + \gamma v_\pi(S') \mid s] E[R+γvπ(S′)∣s]，只能通过样本 ( r , s ′ ) (r, s') (r,s′) 来近似。于是我们定义：
    
    g ~ ( v ( s ) ) = v ( s ) − [ r + γ v π ( s ′ ) ] . \tilde g(v(s)) = v(s) - [r + \gamma v_\pi(s')]. g~(v(s))=v(s)−[r+γvπ(s′)].
  - RM 更新式为：
    
    v k + 1 ( s ) = v k ( s ) − α k g ~ ( v k ( s ) ) = v k ( s ) − α k ( v k ( s ) − [ r k + γ v π ( s k ′ ) ] ) . v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big). vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]).
  - 这一步的含义：
    - 用样本 ( r k , s k ′ ) (r_k, s'_k) (rk,sk′) 构造近似的"梯度" g ~ \tilde g g~。
    - 然后不断更新 v k ( s ) v_k(s) vk(s)，使其逐渐逼近 Bellman 方程的解。
Therefore, the RM algorithm for solving g ( v ( s ) ) = 0 g(v(s)) = 0 g(v(s))=0 is

v k + 1 ( s ) = v k ( s ) − α k g ~ ( v k ( s ) ) = v k ( s ) − α k ( v k ( s ) − [ r k + γ v π ( s k ′ ) ] ) , k = 1 , 2 , 3 , ... ( 6 ) v_{k+1}(s) = v_k(s) - \alpha_k \tilde g(v_k(s)) = v_k(s) - \alpha_k \Big( v_k(s) - [r_k + \gamma v_\pi(s'_k)] \Big), \quad k = 1,2,3,\ldots \quad (6) vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]),k=1,2,3,...(6)
- where v k ( s ) v_k(s) vk(s) is the estimate of v π ( s ) v_\pi(s) vπ(s) at the k k kth step; r k , s k ′ r_k, s'_k rk,sk′ are the samples of R , S ′ R, S' R,S′ obtained at the k k kth step.
- The RM algorithm in (6) has two assumptions that deserve special attention:
  - We must have the experience set ( s , r , s ′ ) {(s, r, s')} (s,r,s′) for k = 1 , 2 , 3 , ... k=1,2,3,\ldots k=1,2,3,....
  - We assume that v π ( s ′ ) v_\pi(s') vπ(s′) is already known for any s ′ s' s′.
- To remove the two assumptions in the RM algorithm, we can modify it
  - One modification is that ( s , r , s ′ ) {(s,r,s')} (s,r,s′) is changed to ( s t , r t + 1 , s t + 1 ) {(s_t, r_{t+1}, s_{t+1})} (st,rt+1,st+1) so that the algorithm can utilize the sequential samples in an episode.
  - Another modification is that v π ( s ′ ) v_\pi(s') vπ(s′) is replaced by an estimate of it because we don't know it in advance.
与 TD 学习的联系
- RM 算法有两个限制：
  1. 需要知道 v π ( s ′ ) v_\pi(s') vπ(s′)，但实际上我们并不知道；
  2. 需要 ( s , r , s ′ ) (s, r, s') (s,r,s′) 样本，最好是整个 episode 数据。
- 为了解决这两个问题：
  - 我们 用当前的估计 v ( s ′ ) v(s') v(s′) 替换真值 v π ( s ′ ) v_\pi(s') vπ(s′)；
  - 我们 用序列样本 ( s t , r t + 1 , s t + 1 ) (s_t, r_{t+1}, s_{t+1}) (st,rt+1,st+1) 来进行逐步更新。
- 于是就得到了 TD 更新公式：
  
  v t + 1 ( s t ) = v t ( s t ) + α t [ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) ] . v_{t+1}(s_t) = v_t(s_t) + \alpha_t \big[ r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \big]. vt+1(st)=vt(st)+αt[rt+1+γvt(st+1)−vt(st)].
  - 这里：
    - TD Target : v ˉ t = r t + 1 + γ v t ( s t + 1 ) \bar v_t = r_{t+1} + \gamma v_t(s_{t+1}) vˉt=rt+1+γvt(st+1)
    - TD Error : δ t = v ˉ t − v t ( s t ) \delta_t = \bar v_t - v_t(s_t) δt=vˉt−vt(st)
直观理解
- 对于 trajectory 的每一个 s s s 都用 TD 公式更新到收敛
- "所有状态互相帮助一起慢慢收敛"

Algorithm

Algorithm convergence
- By the TD algorithm (1), v t ( s ) v_t(s) vt(s) converges with probability 1 to v π ( s ) v_\pi(s) vπ(s) for all s ∈ S s \in \mathcal{S} s∈S as t → ∞ t \to \infty t→∞ if ∑ t α t ( s ) = ∞ \sum_t \alpha_t(s) = \infty ∑tαt(s)=∞ and ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞ for all s ∈ S s \in \mathcal{S} s∈S.
- Remarks:
  - This theorem says the state value can be found by the TD algorithm for a given a policy π \pi π.
  - ∑ t α t ( s ) = ∞ \sum_t \alpha_t(s) = \infty ∑tαt(s)=∞ and ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞ must be valid for all s ∈ S s \in \mathcal{S} s∈S . At time step t t t, if s = s t s = s_t s=st which means that s s s is visited at time t t t, then α t ( s ) > 0 \alpha_t(s) > 0 αt(s)>0; otherwise, α t ( s ) = 0 \alpha_t(s) = 0 αt(s)=0 for all the other s ≠ s t s \ne s_t s=st. That requires every state must be visited an infinite (or sufficiently many) number of times.
  - The learning rate α \alpha α is often selected as a small constant. In this case, the condition that ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞ is invalid anymore. When α \alpha α is constant, it can still be shown that the algorithm converges in the sense of expectation sense.
定理内容
- 在合适的条件下，TD 学习能够收敛到真实的状态值函数 v π ( s ) v_\pi(s) vπ(s)。
- 条件是：
  - ∑ t α t ( s ) = ∞ \sum_t \alpha_t(s) = \infty ∑tαt(s)=∞ （学习率必须足够大，保证无穷多次更新）
  - ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞ （学习率不能太大，保证更新逐渐收敛而不是震荡）
解释
- 第一条条件保证了 TD 算法不断吸收新信息，不会过早停止学习。
- 第二条条件保证了 TD 算法不会因为过大的更新幅度而在收敛点附近震荡。
学习率的取值
- 实践中，通常将 α t ( s ) \alpha_t(s) αt(s) 设为一个 小常数 （如 0.1 0.1 0.1）。
- 这样严格来说不满足 ∑ t α t 2 ( s ) < ∞ \sum_t \alpha_t^2(s) < \infty ∑tαt2(s)<∞，但在"期望意义"下依然能保证收敛到接近 v π ( s ) v_\pi(s) vπ(s)。

Algorithm properties

TD/Sarsa learning	MC learning
Online: TD learning is online. It can update the state/action values immediately after receiving a reward.	Offline: MC learning is offline. It has to wait until an episode has been completely collected.
Continuing tasks: Since TD learning is online, it can handle both episodic and continuing tasks.	Episodic tasks: Since MC learning is offline, it can only handle episodic tasks that has terminate states.
Bootstrapping: TD bootstraps because the update of a value relies on the previous estimate of this value. Hence, it requires initial guesses.	Non-bootstrapping: MC is not bootstrapping, because it can directly estimate state/action values without any initial guess.
Low estimation variance: TD has lower than MC because there are fewer random variables. For instance, Sarsa requires R t + 1 , S t + 1 , A t + 1 R_{t+1}, S_{t+1}, A_{t+1} Rt+1,St+1,At+1.	High estimation variance: To estimate q π ( s t , a t ) q_\pi(s_t, a_t) qπ(st,at), we need samples of R t + 1 + γ R t + 2 + γ 2 R t + 3 + ... R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots Rt+1+γRt+2+γ2Rt+3+.... Suppose the length of each episode is L L L. There are $

TD (Temporal Difference) 学习和 MC (Monte Carlo) 学习都是 model-free 方法，但它们有显著的差异。

在线 vs 离线

TD 是 在线更新 ：每一步获得奖励 r t + 1 r_{t+1} rt+1 后，就能立刻更新 v ( s t ) v(s_t) v(st)。

MC 是 离线更新：必须等整个 episode 结束，才能计算回报并更新。

含义：TD 更适合 实时学习场景 ，MC 适合 完整轨迹收集 的情况。

任务类型

TD ：既能处理 episodic （有终止状态），也能处理 continuing tasks（无终止状态）。

MC：只能处理 episodic 任务，因为需要完整回报作为更新目标。

含义：TD 更灵活，适合长期运行的系统（如智能体在无限时间的环境中学习）。

Bootstrapping

TD 使用 bootstrapping ：即更新时依赖当前的估计值（如 v ( s t + 1 ) v(s_{t+1}) v(st+1)）。

MC 是 非 bootstrapping ：直接使用完整回报 G t G_t Gt 更新。

含义：TD 更快，因为它不需要等完整 episode，但它依赖初始值。

估计方差

TD：低方差，因为更新只依赖一个即时奖励和下一个状态估计。

MC：高方差，因为回报包含很多随机变量，导致估计更不稳定（无偏估计）。

含义：TD 学习通常比 MC 收敛更快、更稳定 ，但也可能引入 偏差 (bias)，因为 bootstrapping 使用了近似的估计。

Sarsa

Base Sarsa

Sarsa algorithm

First, our aim is to estimate the action values of a given policy π \pi π.
Suppose we have some experience ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) t {(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})}_t (st,at,rt+1,st+1,at+1)t.
We can use the following Sarsa algorithm to estimate the action values:

q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ q t ( s t + 1 , a t + 1 ) ] ] , q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - [r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})] \Big], qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]],

q t + 1 ( s , a ) = q t ( s , a ) , ∀ ( s , a ) ≠ ( s t , a t ) , q_{t+1}(s,a) = q_t(s,a), \quad \forall (s,a) \neq (s_t, a_t), qt+1(s,a)=qt(s,a),∀(s,a)=(st,at),
- where t = 0 , 1 , 2 , ... t = 0,1,2,\ldots t=0,1,2,...
  - q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is an estimate of q π ( s t , a t ) q_\pi(s_t, a_t) qπ(st,at)
  - α t ( s t , a t ) \alpha_t(s_t,a_t) αt(st,at) is the learning rate depending on ( s t , a t ) (s_t,a_t) (st,at)
Sarsa 更新公式
- 当前的估计值： q t ( s t , a t ) q_t(s_t,a_t) qt(st,at)
- TD Target： r t + 1 + γ q t ( s t + 1 , a t + 1 ) r_{t+1} + \gamma q_t(s_{t+1},a_{t+1}) rt+1+γqt(st+1,at+1)
  - r t + 1 r_{t+1} rt+1：在执行动作 a t a_t at 后得到的即时奖励。
  - γ q t ( s t + 1 , a t + 1 ) \gamma q_t(s_{t+1}, a_{t+1}) γqt(st+1,at+1)：未来从 ( s t + 1 , a t + 1 ) (s_{t+1}, a_{t+1}) (st+1,at+1) 开始继续的长期回报估计。
- TD Error： q t ( s t , a t ) − TD Target q_t(s_t,a_t) - \text{TD Target} qt(st,at)−TD Target
每次更新，Sarsa 都会把 q ( s t , a t ) q(s_t,a_t) q(st,at) 往"下一步回报 + 下一步动作的估计值"这个目标拉近一些。

直观解释
- 我们对 ( s t , a t ) (s_t,a_t) (st,at) 的"好坏"有个旧估计 q t ( s t , a t ) q_t(s_t,a_t) qt(st,at)。
- 但现在我们看到了一步真实的奖励 r t + 1 r_{t+1} rt+1，以及下一步 ( s t + 1 , a t + 1 ) (s_{t+1},a_{t+1}) (st+1,at+1) 的估计值 q t ( s t + 1 , a t + 1 ) q_t(s_{t+1},a_{t+1}) qt(st+1,at+1)。
- 这就提供了一个新目标（TD Target）。
- Sarsa 更新时，不会直接替换，而是慢慢地把 q ( s t , a t ) q(s_t,a_t) q(st,at) 往这个目标拉近。
Relationship between Sarsa and TD
- Replace v ( s ) v(s) v(s) in TD algorithm with q ( s , a ) q(s,a) q(s,a) → we obtain Sarsa.
- Sarsa is the action-value version of TD learning.
Mathematical Expression
- The Sarsa algorithm solves:
  
  q π ( s , a ) = E [ R + γ q π ( S ' , A ' ) ∣ s , a ] , ∀ s , a . q_\pi(s,a) = \mathbb{E}[R + \gamma q_\pi(S', A') \mid s,a], \quad \forall s,a. qπ(s,a)=E[R+γqπ(S',A')∣s,a],∀s,a.
- This is another expression of the Bellman equation expressed in terms of action values.

Theorem (Convergence of Sarsa learning)

By the Sarsa algorithm, q t ( s , a ) q_t(s,a) qt(s,a) converges with probability 1 1 1 to the action value q π ( s , a ) q_\pi(s,a) qπ(s,a) as t → ∞ t \to \infty t→∞, for all ( s , a ) (s,a) (s,a), if ∑ t α t ( s , a ) = ∞ \sum_t \alpha_t(s,a) = \infty ∑tαt(s,a)=∞ and ∑ t α t 2 ( s , a ) < ∞ \sum_t \alpha_t^2(s,a) < \infty ∑tαt2(s,a)<∞.

Remarks:
- This theorem says the action value can be found by Sarsa for a given policy π \pi π.

Pseudocode: Policy searching by Sarsa

For each episode, do
- If current s_t is not the target state, do
  - Collect experience ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}) (st,at,rt+1,st+1,at+1):
    - Take action a t ∼ π t ( s t ) a_t \sim \pi_t(s_t) at∼πt(st)
    - Generate r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1
    - Take action a t + 1 ∼ π t ( s t + 1 ) a_{t+1} \sim π_t(s_{t+1}) at+1∼πt(st+1)
    - Update q-value:
      
      q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ q t ( s t + 1 , a t + 1 ) ) ] q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - \big( r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) \big) \Big] qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γqt(st+1,at+1))]
    - Update policy:
      
      π t + 1 ( a ∣ s t ) = 1 − ϵ ∣ A ∣ ( ∣ A ∣ − 1 ) , if a = arg ⁡ max ⁡ a q t + 1 ( s t , a ) \pi_{t+1}(a \mid s_t) = 1 - \frac{\epsilon}{|\mathcal{A}|} (|\mathcal{A}| - 1), \quad \text{if } a = \arg \max_a q_{t+1}(s_t, a) πt+1(a∣st)=1−∣A∣ϵ(∣A∣−1),if a=argmaxaqt+1(st,a)
      
      π t + 1 ( a ∣ s t ) = ϵ ∣ A ∣ , otherwise \pi_{t+1}(a \mid s_t) = \frac{\epsilon}{|\mathcal{A}|}, \quad \text{otherwise} πt+1(a∣st)=∣A∣ϵ,otherwise

Sarsa 的伪代码可以概括为三步：

采样交互 ： ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}) (st,at,rt+1,st+1,at+1)

更新 Q 值：往"即时奖励 + 下一步 Q"方向拉近

更新策略：根据最新 Q 值，调整动作选择概率

这样样就实现了 "一边学 Q 值，一边改策略" 的在线强化学习过程。

Remarks about Sarsa

The policy of s t s_t st is updated immediately after q ( s t , a t ) q(s_t,a_t) q(st,at) is updated → based on Generalized Policy Iteration (GPI).
The policy is ϵ \epsilon ϵ-greedy instead of greedy → balances exploitation and exploration.

Core Idea vs Complication

Core idea: use an algorithm to solve the Bellman equation of a given policy.
Complication : emerges when we try to find optimal policies and work efficiently.

Expected Sarsa

Algorithm

q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ E [ q t ( s t + 1 , A ) ] ) ] , q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]\big)\Big], qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γE[qt(st+1,A)])],

q t + 1 ( s , a ) = q t ( s , a ) , ∀ ( s , a ) ≠ ( s t , a t ) , q_{t+1}(s, a) = q_t(s, a), \quad \forall (s,a) \neq (s_t,a_t), qt+1(s,a)=qt(s,a),∀(s,a)=(st,at),

where

E [ q t ( s t + 1 , A ) ) ] = ∑ a π t ( a ∣ s t + 1 ) q t ( s t + 1 , a ) ≐ v t ( s t + 1 ) \mathbb{E}[q_t(s_{t+1}, A))] = \sum_a \pi_t(a \mid s_{t+1}) q_t(s_{t+1}, a) \doteq v_t(s_{t+1}) E[qt(st+1,A))]=∑aπt(a∣st+1)qt(st+1,a)≐vt(st+1)
is the expected value of q t ( s t + 1 , a ) q_t(s_{t+1}, a) qt(st+1,a) under policy π t \pi_t πt.

Compared to Sarsa:

The TD Target is changed from r t + 1 + γ q t ( s t + 1 , a t + 1 ) r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}) rt+1+γqt(st+1,at+1) (as in Sarsa) to r t + 1 + γ E [ q t ( s t + 1 , A ) ] r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)] rt+1+γE[qt(st+1,A)] (as in Expected Sarsa).
Need more computation. But it is beneficial in the sense that it reduces the estimation variances because it reduces random variables in Sarsa from s t , a t , r t + 1 , s t + 1 , a t + 1 {s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}} st,at,rt+1,st+1,at+1 to s t , a t , r t + 1 , s t + 1 {s_t, a_t, r_{t+1}, s_{t+1}} st,at,rt+1,st+1.

What does the algorithm do mathematically?

Expected Sarsa is a stochastic approximation algorithm for solving the following equation:

q π ( s , a ) = E [ R t + 1 + γ E A t + 1 ∼ π ( S t + 1 ) [ q π ( S t + 1 , A t + 1 ) ] , ∣ , S t = s , A t = a ] , ∀ s , a . q_\pi(s,a) = \mathbb{E}\Big[ R_{t+1} + \gamma \mathbb{E}{A{t+1} \sim \pi(S_{t+1})}[q_\pi(S_{t+1}, A_{t+1})] ,\Big|, S_t=s, A_t=a \Big], \quad \forall s,a. qπ(s,a)=E[Rt+1+γEAt+1∼π(St+1)[qπ(St+1,At+1)], ,St=s,At=a],∀s,a.
The above equation is another expression of the Bellman equation:

q π ( s , a ) = E [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s , A t = a ] . q_\pi(s,a) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a]. qπ(s,a)=E[Rt+1+γvπ(St+1)∣St=s,At=a].

n n n-step Sarsa

Introduction

The definition of action value is

q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] . q_\pi(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a]. qπ(s,a)=E[Gt∣St=s,At=a].
The discounted return G t G_t Gt can be written in different forms as
- Sarsa
  
  G t ( 1 ) = R t + 1 + γ q π ( S t + 1 , A t + 1 ) , G_t^{(1)} = R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}), Gt(1)=Rt+1+γqπ(St+1,At+1),
  
  G t ( 2 ) = R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , A t + 2 ) , G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, A_{t+2}), Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),
  
  ⋮ \vdots ⋮
- n-step Sarsa
  
  G t ( n ) = R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) , G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}), Gt(n)=Rt+1+γRt+2+⋯+γnqπ(St+n,At+n),
  
  ⋮ \vdots ⋮
- MC
  
  G t ( ∞ ) = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots Gt(∞)=Rt+1+γRt+2+γ2Rt+3+⋯
It should be noted that

G t = G t ( 1 ) = G t ( 2 ) = G t ( n ) = G t ( ∞ ) , G_t = G_t^{(1)} = G_t^{(2)} = G_t^{(n)} = G_t^{(\infty)}, Gt=Gt(1)=Gt(2)=Gt(n)=Gt(∞),
where the superscripts merely indicate the different decomposition structures of G t G_t Gt.

Algorithm analysis

Sarsa aims to solve

q π ( s , a ) = E [ G t ( 1 ) ∣ s , a ] = E [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ s , a ] . q_\pi(s,a) = \mathbb{E}[G_t^{(1)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid s,a]. qπ(s,a)=E[Gt(1)∣s,a]=E[Rt+1+γqπ(St+1,At+1)∣s,a].
MC learning aims to solve

q π ( s , a ) = E [ G t ( ∞ ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ ∣ s , a ] . q_\pi(s,a) = \mathbb{E}[G_t^{(\infty)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid s,a]. qπ(s,a)=E[Gt(∞)∣s,a]=E[Rt+1+γRt+2+γ2Rt+3+⋯∣s,a].
An intermediate algorithm called n-step Sarsa aims to solve

q π ( s , a ) = E [ G t ( n ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ∣ s , a ] . q_\pi(s,a) = \mathbb{E}[G_t^{(n)} \mid s,a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}) \mid s,a]. qπ(s,a)=E[Gt(n)∣s,a]=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣s,a].
The algorithm of n-step Sarsa is

q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ r t + 2 + ⋯ + γ n q t ( s t + n , a t + n ) ) ] . q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)\Big[q_t(s_t, a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n}, a_{t+n})\big)\Big]. qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γrt+2+⋯+γnqt(st+n,at+n))].
n n n-step Sarsa is more general because it becomes the (one-step) Sarsa algorithm when n = 1 n=1 n=1 and the MC learning algorithm when n = ∞ n=\infty n=∞.

Properties

n n n-step Sarsa needs

( s t , a t , r t + 1 , s t + 1 , a t + 1 , ... , r t + n , s t + n , a t + n ) . (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \ldots, r_{t+n}, s_{t+n}, a_{t+n}). (st,at,rt+1,st+1,at+1,...,rt+n,st+n,at+n).
Since ( r t + n , s t + n , a t + n ) (r_{t+n}, s_{t+n}, a_{t+n}) (rt+n,st+n,at+n) has not been collected at time t t t, we are not able to implement n-step Sarsa at step t t t. However, we can wait until time t + n t+n t+n to update the q-value of ( s t , a t ) (s_t,a_t) (st,at):

q t + n ( s t , a t ) = q t + n − 1 ( s t , a t ) − α t + n − 1 ( s t , a t ) [ q t + n − 1 ( s t , a t ) − ( r t + 1 + γ r t + 2 + ⋯ + γ n q t + n − 1 ( s t + n , a t + n ) ) ] . q_{t+n}(s_t, a_t) = q_{t+n-1}(s_t, a_t) - \alpha_{t+n-1}(s_t,a_t)\Big[q_{t+n-1}(s_t,a_t) - \big(r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^n q_{t+n-1}(s_{t+n}, a_{t+n})\big)\Big]. qt+n(st,at)=qt+n−1(st,at)−αt+n−1(st,at)[qt+n−1(st,at)−(rt+1+γrt+2+⋯+γnqt+n−1(st+n,at+n))].
Since n n n-step Sarsa includes Sarsa and MC learning as two extreme cases, its performance is a blend of Sarsa and MC learning:
- If n n n is large, its performance is close to MC learning and hence has a large variance but a small bias.
- If n n n is small, its performance is close to Sarsa and hence has a relatively large bias due to the initial guess and relatively low variance.
Finally, n n n-step Sarsa is also for policy evaluation. It can be combined with the policy improvement step to search for optimal policies.

总结

Sarsa 是 on-policy，学习并评估当前执行的策略；Q-learning 是 off-policy，利用任意行为策略采样却始终朝最优策略收敛；TD 方法则是统一的核心框架，通过 TD Target 和 TD Error 将价值估计逐步逼近 Bellman 方程的解。