RL【8】：Value Function Approximation

系列文章目录

文章目录

系列文章目录
- [Fundamental Tools](#Fundamental Tools)
- Algorithm
- Method
前言
[Algorithm for state value estimation](#Algorithm for state value estimation)
- [Objective function](#Objective function)
- [Optimization algorithms](#Optimization algorithms)
- [Selection of function approximators](#Selection of function approximators)
- [Theoretical analysis](#Theoretical analysis)
[Sarsa & Q-learning with function approximation](#Sarsa & Q-learning with function approximation)
- [Sarsa with function approximation](#Sarsa with function approximation)
- [Q-learning with function approximation](#Q-learning with function approximation)
[Deep Q-learning](#Deep Q-learning)
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：

B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】

GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Algorithm for state value estimation

Objective function

Formal introduction

Let v π ( s ) v_\pi(s) vπ(s) and v ^ ( s , w ) \hat v(s,w) v^(s,w) be the true state value and a function for approximation.
Our goal is to find an optimal w so that v ^ ( s , w ) \hat v(s,w) v^(s,w) can best approximate v π ( s ) v_\pi(s) vπ(s) for every s s s.
This is a policy evaluation problem. Later we will extend to policy improvement.
To find the optimal w w w, we need two steps .
- The first step is to define an objective function.
- The second step is to derive algorithms optimizing the objective function.

问题背景：函数近似的价值估计

在实际强化学习中，状态空间可能非常大（甚至连续），没法为每个状态单独存储一个 v π ( s ) v_\pi(s) vπ(s)。

因此我们需要用一个函数近似器（比如线性函数、神经网络） v ^ ( s , w ) \hat v(s,w) v^(s,w) 来逼近真实的 v π ( s ) v_\pi(s) vπ(s)。

目标：找到一组参数 w w w，使得 v ^ ( s , w ) \hat v(s,w) v^(s,w) 尽可能接近 v π ( s ) v_\pi(s) vπ(s)。

Objective function

J ( w ) = E ⁣ [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] . J(w)=\mathbb{E}\!\left[(\,v_\pi(S)-\hat v(S,w)\,)^2\right]. J(w)=E[(vπ(S)−v^(S,w))2].

Our goal is to find the best w that can minimize J ( w ) J(w) J(w).
The expectation is with respect to the random variable S ∈ S S\in\mathcal{S} S∈S.

Several ways to define the probability distribution of S S S

The first way is to use a uniform distribution
- That is to treat all the states to be equally important by setting the probability of each state as 1 / ∣ S ∣ 1/|\mathcal{S}| 1/∣S∣.
- In this case, the objective function becomes
  
  J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] = 1 ∣ S ∣ ∑ s ∈ S ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \mathbb{E}[(v_\pi(S) - \hat v(S, w))^2] = \frac{1}{|\mathcal{S}|} \sum_{s \in \mathcal{S}} (v_\pi(s) - \hat v(s, w))^2. J(w)=E[(vπ(S)−v^(S,w))2]=∣S∣1∑s∈S(vπ(s)−v^(s,w))2.
- Drawback:
  - The states may not be equally important. For example, some states may be rarely visited by a policy. Hence, this way does not consider the real dynamics of the Markov process under the given policy.
The second way is to use the stationary distribution
- Stationary distribution is an important concept that will be frequently used in this course. In short, it describes the long-run behavior of a Markov process.
- Let d π ( s ) s ∈ S {d_\pi(s)}{s \in \mathcal{S}} dπ(s)s∈S denote the stationary distribution of the Markov process under policy π \pi π. By definition, d π ( s ) ≥ 0 d\pi(s) \ge 0 dπ(s)≥0 and ∑ s ∈ S d π ( s ) = 1 \sum_{s \in \mathcal{S}} d_\pi(s) = 1 ∑s∈Sdπ(s)=1.
- The objective function can be rewritten as
  
  J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] = ∑ s ∈ S d π ( s ) ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \mathbb{E}[(v_\pi(S) - \hat v(S, w))^2] = \sum_{s \in \mathcal{S}} d_\pi(s)(v_\pi(s) - \hat v(s, w))^2. J(w)=E[(vπ(S)−v^(S,w))2]=∑s∈Sdπ(s)(vπ(s)−v^(s,w))2.
- This function is a weighted squared error.
- Since more frequently visited states have higher values of d π ( s ) d_\pi(s) dπ(s), their weights in the objective function are also higher than those rarely visited states.

状态分布的选择 ------ 两种方式

关键问题：**期望 E \mathbb{E} E 是对哪个分布下的状态 S S S 取的？**这会影响训练出来的近似器"偏向于哪些状态更准确"。

Uniform distribution

做法：假设所有状态都等重要，给每个状态分配相同概率：

P ( S = s ) = 1 ∣ S ∣ . P(S=s) = \frac{1}{|\mathcal{S}|}. P(S=s)=∣S∣1.

目标函数变为：

J ( w ) = 1 ∣ S ∣ ∑ s ∈ S ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \frac{1}{|\mathcal{S}|}\sum_{s\in\mathcal{S}} \big(v_\pi(s) - \hat v(s,w)\big)^2. J(w)=∣S∣1∑s∈S(vπ(s)−v^(s,w))2.

优点：

简单直观，保证所有状态都有"平等对待"。

缺点：

不符合实际。现实中某些状态出现得很少（比如游戏里的罕见场景），强行要求对它们也拟合得很好，会浪费模型容量。

没有体现马尔可夫过程在策略 π \pi π 下的真实动态。

Stationary distribution

做法：考虑在策略 π \pi π 下，环境长期运行后，每个状态出现的概率分布 d π ( s ) d_\pi(s) dπ(s)（即 stationary distribution）。

目标函数变为：

J ( w ) = ∑ s ∈ S d π ( s ) , ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \sum_{s \in \mathcal{S}} d_\pi(s),\big(v_\pi(s) - \hat v(s,w)\big)^2. J(w)=∑s∈Sdπ(s),(vπ(s)−v^(s,w))2.

优点：

更贴合真实情况，因为智能体在实际运行时，会频繁遇到某些状态而几乎不会遇到另外一些。

在这些"高频状态"上的估计更准确，提升实际执行效果。

缺点：

低频状态可能拟合得很差，尤其当某些关键状态虽然重要但很少出现时。

Stationary Distribution:

Distribution: Distribution of the state
Stationary: Long-run behavior
Summary: after the agent runs a long time following a policy, the probability that the agent is at any state can be described by this distribution.

stationary distribution（平稳分布/稳态分布） 相关的基本概念

1. Stationary Distribution 的定义

在马尔可夫过程 (Markov Process) 或马尔可夫决策过程 (MDP) 中，智能体随着时间不断转移状态。

Stationary distribution 指的是：当智能体运行足够长时间后，落在每个状态的概率分布。

数学形式：如果 d π ( s ) s ∈ S {d_\pi(s)}{s \in \mathcal{S}} dπ(s)s∈S 是策略 π \pi π 下的 stationary distribution，则有：

d π ( s ' ) = ∑ s ∈ S d π ( s ) ∑ a ∈ A π ( a ∣ s ) P ( s ' ∣ s , a ) , d\pi(s') = \sum_{s \in \mathcal{S}} d_\pi(s) \sum_{a \in \mathcal{A}} \pi(a|s) P(s'|s,a), dπ(s')=∑s∈Sdπ(s)∑a∈Aπ(a∣s)P(s'∣s,a),

并且

∑ s ∈ S d π ( s ) = 1 , d π ( s ) ≥ 0. \sum_{s \in \mathcal{S}} d_\pi(s) = 1, \quad d_\pi(s) \ge 0. ∑s∈Sdπ(s)=1,dπ(s)≥0.

相关的基本概念

Distribution

字面意义：某个变量的概率分布。

在这里是 状态分布 ：即智能体处于每个状态 s ∈ S s \in \mathcal{S} s∈S 的概率。

Stationary

指的是 长期稳定 的状态。

当时间 t → ∞ t \to \infty t→∞ 时，状态分布趋于固定值，不再随着时间波动。

也就是说，状态分布收敛到了一个平衡点。

Steady-state distribution / Limiting distribution

同义词 ：Stationary distribution 也常被称为 steady-state distribution（稳态分布） 或 limiting distribution（极限分布）。

强调的是：它是一个长期极限意义下的稳定分布。

在强化学习中的意义

价值函数逼近

在 近似方法（value function approximation） 中，我们用期望定义目标函数：

J ( w ) = E s ∼ d π [ ( v π ( s ) − v ^ ( s , w ) ) 2 ] , J(w) = \mathbb{E}{s \sim d\pi} \big[(v_\pi(s) - \hat v(s,w))^2\big], J(w)=Es∼dπ[(vπ(s)−v^(s,w))2],

其中 d π d_\pi dπ 就是 stationary distribution。

这样我们对常访问的状态赋予更大权重，更符合策略的实际表现。

策略梯度 (Policy Gradient)

在 策略梯度方法 中，性能目标函数通常写为：

J ( π ) = ∑ s d π ( s ) ∑ a π ( a ∣ s ) q π ( s , a ) . J(\pi) = \sum_s d_\pi(s) \sum_a \pi(a|s) q_\pi(s,a). J(π)=∑sdπ(s)∑aπ(a∣s)qπ(s,a).

这里 d π ( s ) d_\pi(s) dπ(s) 表示智能体在策略 π \pi π 下，长期处于状态 s s s 的概率。

因此，stationary distribution 是策略梯度方法的核心组成部分。

直观解释

如果智能体执行一个策略 π \pi π 很久以后：

常访问的状态在 d π ( s ) d_\pi(s) dπ(s) 下概率更高；

很少访问的状态概率接近 0 0 0。

所以 d π ( s ) d_\pi(s) dπ(s) 反映了该策略下"现实中真正重要的状态"。

总结：

stationary distribution 描述了策略 π \pi π 下智能体 长期访问状态的概率分布。

它也叫 steady-state distribution 或 limiting distribution。

在 价值函数逼近 和 策略梯度方法 中起着关键作用，因为它决定了优化时不同状态的重要性权重。

Optimization algorithms

Gradient Descent for Value Function Approximation

While we have the objective function, the next step is to optimize it.
To minimize the objective function J ( w ) J(w) J(w), we can use the gradient-descent algorithm:

w k + 1 = w k − α k ∇ w J ( w k ) w_{k+1} = w_k - \alpha_k \nabla_w J(w_k) wk+1=wk−αk∇wJ(wk)
The true gradient is:

∇ w J ( w ) = ∇ w E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] \nabla_w J(w) = \nabla_w \mathbb{E}[(v_\pi(S) - \hat v(S,w))^2] ∇wJ(w)=∇wE[(vπ(S)−v^(S,w))2]

= E [ ∇ w ( v π ( S ) − v ^ ( S , w ) ) 2 ] = \mathbb{E}[\nabla_w (v_\pi(S) - \hat v(S,w))^2] =E[∇w(vπ(S)−v^(S,w))2]

= 2 E [ ( v π ( S ) − v ^ ( S , w ) ) ( − ∇ w v ^ ( S , w ) ) ] = 2\mathbb{E}[(v_\pi(S) - \hat v(S,w))(-\nabla_w \hat v(S,w))] =2E[(vπ(S)−v^(S,w))(−∇wv^(S,w))]

= − 2 E [ ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) ] = -2\mathbb{E}[(v_\pi(S) - \hat v(S,w))\nabla_w \hat v(S,w)] =−2E[(vπ(S)−v^(S,w))∇wv^(S,w)]
The true gradient above involves the calculation of an expectation.

Stochastic Gradient

We can use the stochastic gradient to replace the true gradient:

w t + 1 = w t + α t ( v π ( s t ) − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) , w_{t+1} = w_t + \alpha_t (v_\pi(s_t) - \hat v(s_t, w_t)) \nabla_w \hat v(s_t, w_t), wt+1=wt+αt(vπ(st)−v^(st,wt))∇wv^(st,wt),
- where s t s_t st is a sample of S S S. Here, 2 α k 2\alpha_k 2αk is merged to α k \alpha_k αk.
This algorithm is not implementable because it requires the true state value v π v_\pi vπ, which is the unknown to be estimated.
We can replace v π ( s t ) v_\pi(s_t) vπ(st) with an approximation so that the algorithm is implementable.

Monte Carlo and TD Learning with Function Approximation

First, Monte Carlo learning with function approximation

Let g t g_t gt be the discounted return starting from s t s_t st in the episode. Then, g t g_t gt can be used to approximate v π ( s t ) v_\pi(s_t) vπ(st). The algorithm becomes:

w t + 1 = w t + α t ( g t − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t (g_t - \hat v(s_t, w_t)) \nabla_w \hat v(s_t, w_t). wt+1=wt+αt(gt−v^(st,wt))∇wv^(st,wt).
Second, TD learning with function approximation

By the spirit of TD learning, r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1} + \gamma \hat v(s_{t+1}, w_t) rt+1+γv^(st+1,wt) can be viewed as an approximation of v π ( s t ) v_\pi(s_t) vπ(st). Then, the algorithm becomes:

w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t [r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t)] \nabla_w \hat v(s_t, w_t). wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt).

深入解释

为什么要用 Gradient Descent?

我们有一个目标函数：

J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] . J(w)=\mathbb{E}[(v_\pi(S)-\hat v(S,w))^2]. J(w)=E[(vπ(S)−v^(S,w))2].

目标是最小化 真实状态价值 v π ( S ) v_\pi(S) vπ(S) 与 近似函数 v ^ ( S , w ) \hat v(S,w) v^(S,w) 的误差。

这相当于一个 回归问题 ：拟合一个函数 v ^ \hat v v^ 来逼近真实的 v π v_\pi vπ。

于是，可以使用最常见的优化方法 ------ 梯度下降 (Gradient Descent)：

w k + 1 = w k − α k ∇ w J ( w k ) . w_{k+1} = w_k - \alpha_k \nabla_w J(w_k). wk+1=wk−αk∇wJ(wk).

也就是说，每一步更新参数 w w w，使得 J ( w ) J(w) J(w) 逐渐减小。

真梯度 (True Gradient) 的含义

通过链式法则展开：

∇ w J ( w ) = − 2 E [ ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) ] . \nabla_w J(w) = -2\mathbb{E}[(v_\pi(S) - \hat v(S,w)) \nabla_w \hat v(S,w)]. ∇wJ(w)=−2E[(vπ(S)−v^(S,w))∇wv^(S,w)].

解释：

误差项 ( v π ( S ) − v ^ ( S , w ) ) (v_\pi(S) - \hat v(S,w)) (vπ(S)−v^(S,w)) 表示预测与真实值之间的差距。

乘上 ∇ w v ^ ( S , w ) \nabla_w \hat v(S,w) ∇wv^(S,w)，告诉我们 怎样调整参数 w w w 才能缩小这个差距。

负号说明：如果预测比真实值小，就增加 v ^ \hat v v^；反之减少 v ^ \hat v v^。

这和标准的监督学习回归完全一致。

为什么要用 Stochastic Gradient？

问题在于：

真梯度 ∇ w J ( w ) \nabla_w J(w) ∇wJ(w) 涉及对所有状态 S S S 的期望；

这通常不可行，因为状态空间巨大，而且 v π ( S ) v_\pi(S) vπ(S) 也未知。

于是用 SGD (Stochastic Gradient Descent) 替代：

w t + 1 = w t + α t ( v π ( s t ) − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) , w_{t+1} = w_t + \alpha_t (v_\pi(s_t) - \hat v(s_t,w_t)) \nabla_w \hat v(s_t,w_t), wt+1=wt+αt(vπ(st)−v^(st,wt))∇wv^(st,wt),

其中 s t s_t st 是一个采样的状态。

好处：只需要一个样本就能更新，成本低。

问题：它依然需要 v π ( s t ) v_\pi(s_t) vπ(st)，但这正是我们要估计的未知量。

怎么替代 v π ( s t ) v_\pi(s_t) vπ(st)？

因为 v π ( s t ) v_\pi(s_t) vπ(st) 无法直接获得，我们需要找到可以近似它的替代量：

Monte Carlo Learning with Function Approximation

使用 episode 完整回报 g t g_t gt 作为 v π ( s t ) v_\pi(s_t) vπ(st) 的无偏估计。

更新公式：

w t + 1 = w t + α t ( g t − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t (g_t - \hat v(s_t, w_t)) \nabla_w \hat v(s_t, w_t). wt+1=wt+αt(gt−v^(st,wt))∇wv^(st,wt).

直观理解：

g t g_t gt = 从 s t s_t st 出发一路走到底的累计奖励。

用 g t g_t gt 替代 v π ( s t ) v_\pi(s_t) vπ(st)，再做梯度下降。

缺点：要等 整条轨迹结束 才能更新；方差大。

TD Learning with Function Approximation

使用 TD Target r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1} + \gamma \hat v(s_{t+1}, w_t) rt+1+γv^(st+1,wt) 来近似 v π ( s t ) v_\pi(s_t) vπ(st)。

更新公式：

w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t \big[r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t)\big] \nabla_w \hat v(s_t, w_t). wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt).

直观理解：

不用等整条轨迹，只看一步奖励 r t + 1 r_{t+1} rt+1 加上下一个状态的预测。

属于 bootstrapping：用已有估计来辅助更新。

优点：在线学习，更新快，方差小；缺点：可能引入偏差。

Pseudocode: TD learning with function approximation

Initialization: A function v ^ ( s , w ) \hat v(s,w) v^(s,w) that is a differentiable in w w w. Initial parameter w 0 w_0 w0.
Aim: Approximate the true state values of a given policy π \pi π.
For each episode generated following the policy π \pi π, do
- For each step ( s t , r t + 1 , s t + 1 ) (s_t, r_{t+1}, s_{t+1}) (st,rt+1,st+1), do
  - In the general case,
    
    w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big] \nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)
  - In the linear case,
    
    w t + 1 = w t + α t [ r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ] ϕ ( s t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \phi^T(s_{t+1}) w_t - \phi^T(s_t) w_t \big] \phi(s_t) wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st)

Selection of function approximators

Function selection
- The first approach, which was widely used before , is to use a linear function
  
  v ^ ( s , w ) = ϕ T ( s ) w \hat v(s, w) = \phi^T(s) w v^(s,w)=ϕT(s)w
  
  Here, ϕ ( s ) \phi(s) ϕ(s) is the feature vector, which can be a polynomial basis, Fourier basis, ... .
- The second approach, which is widely used nowadays , is to use a neural network as a nonlinear function approximator. The input of the NN is the state, the output is v ^ ( s , w ) \hat v(s,w) v^(s,w), and the network parameter is w w w.
TD-Linear
- In the linear case where v ^ ( s , w ) = ϕ T ( s ) w \hat v(s, w) = \phi^T(s) w v^(s,w)=ϕT(s)w, we have
  
  ∇ w v ^ ( s , w ) = ϕ ( s ) . \nabla_w \hat v(s, w) = \phi(s). ∇wv^(s,w)=ϕ(s).
- Substituting the gradient into the TD algorithm
  
  w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big] \nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)
- yields
  
  w t + 1 = w t + α t [ r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ] ϕ ( s t ) , w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \phi^T(s_{t+1}) w_t - \phi^T(s_t) w_t \big] \phi(s_t), wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st),
- which is the algorithm of TD learning with linear function approximation (TD-Linear).
Disadvantages and Advantages of linear function approximation
- Disadvantages of linear function approximation:
  - Difficult to select appropriate feature vectors.
- Advantages of linear function approximation:
  - The theoretical properties of the TD algorithm in the linear case can be much better understood than in the nonlinear case.
  - Linear function approximation is still powerful in the sense that the tabular representation is merely a special case of linear function approximation.
Tabular representation as a special case of linear function approximation

We next show that the tabular representation is a special case of linear function approximation.
- First, consider the special feature vector for state s s s:
  
  ϕ ( s ) = e s ∈ R ∣ S ∣ , \phi(s) = e_s \in \mathbb{R}^{|\mathcal{S}|}, ϕ(s)=es∈R∣S∣,
  
  -where e s e_s es is a vector with the s s s-th entry as 1 1 1 and the others as 0 0 0.
- In this case,
  
  v ^ ( s , w ) = e s T w = w ( s ) , \hat v(s, w) = e_s^T w = w(s), v^(s,w)=esTw=w(s),
  - where w ( s ) w(s) w(s) is the s s s-th entry of w w w.
Connection with Tabular TD
- Recall that the TD-Linear algorithm is
  
  w t + 1 = w t + α t [ r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ] ϕ ( s t ) . w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \phi^T(s_{t+1}) w_t - \phi^T(s_t) w_t \big] \phi(s_t). wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st).
- When ϕ ( s t ) = e s \phi(s_t) = e_s ϕ(st)=es, the above algorithm becomes
  
  w t + 1 = w t + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) e s t . w_{t+1} = w_t + \alpha_t \big( r_{t+1} + \gamma w_t(s_{t+1}) - w_t(s_t) \big) e_{s_t}. wt+1=wt+αt(rt+1+γwt(st+1)−wt(st))est.
  - This is a vector equation that merely updates the s t s_t stth entry of w t w_t wt.
- Multiplying e s t T e_{s_t}^T estT on both sides of the equation gives
  
  w t + 1 ( s t ) = w t ( s t ) + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) , w_{t+1}(s_t) = w_t(s_t) + \alpha_t \big( r_{t+1} + \gamma w_t(s_{t+1}) - w_t(s_t) \big), wt+1(st)=wt(st)+αt(rt+1+γwt(st+1)−wt(st)),
- which is exactly the tabular TD algorithm.

方法选择

该用线性还是神经网络？

线性逼近 ： v ^ ( s , w ) = ϕ ( s ) ⊤ w \hat v(s,w)=\phi(s)^\top w v^(s,w)=ϕ(s)⊤w

你通过手工设计的特征 ϕ ( s ) \phi(s) ϕ(s)（多项式、Fourier、tile coding、one-hot...）把状态映射到低维，再学一个权重向量 w w w。

何时优先用 ：

状态空间不大或可良好表征；

追求可解释性与收敛保证（尤其 on-policy 情形）；

算力或数据有限，需要稳健、低方差的学习器。

非线性逼近（NN） ：直接学 v ^ ( s , w ) = NN ( s ; w ) \hat v(s,w)=\text{NN}(s;w) v^(s,w)=NN(s;w)。

何时优先用 ：

原始状态是高维/非线性（图像、文本、复杂传感器）；

目标是端到端的深度 RL；

有足够数据与算力，且可以接受训练不稳定时的调参成本。

TD-Linear 的本质：半梯度 + 投影不动点

线性情形 ∇ w v ^ ( s , w ) = ϕ ( s ) \nabla_w \hat v(s,w)=\phi(s) ∇wv^(s,w)=ϕ(s)。把它代入 TD(0) 更新：

w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ϕ ( s t ) . w_{t+1}=w_t+\alpha_t\Big[r_{t+1}+\gamma\,\hat v(s_{t+1},w_t)-\hat v(s_t,w_t)\Big]\;\phi(s_t). wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]ϕ(st).

这是半梯度（semi-gradient）方法：把 TD target r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1}+\gamma \hat v(s_{t+1},w_t) rt+1+γv^(st+1,wt) 当作常数 来对 w w w 求导（不反向传播到下一状态的 v ^ \hat v v^ 里），否则会得到"全梯度"算法，实践中反而更容易不稳定。

从几何视角 看，它在求解投影 Bellman 方程：

Φ w ≈ Π d π T π ( Φ w ) , \Phi w \approx \Pi_{d_\pi}\, \mathcal T_\pi(\Phi w), Φw≈ΠdπTπ(Φw),

其中 Φ \Phi Φ 的列空间是特征张成的函数类， Π d π \Pi_{d_\pi} Πdπ 是按stationary distribution d π d_\pi dπ 的最小二乘投影。

含义：环境让你往 T π v \mathcal T_\pi v Tπv 方向走，但你只能留在"可表示"的子空间里，于是把它投影回去。

收敛性 （指 on-policy、线性、适当步长）： T π \mathcal T_\pi Tπ 是压缩映射，半梯度 TD(0) 在很多条件下收敛到上述投影不动点。

表格（tabular）为什么是线性逼近的特例？

取 one-hot 特征： ϕ ( s ) = e s \phi(s)=e_s ϕ(s)=es。则

v ^ ( s , w ) = e s ⊤ w = w ( s ) , \hat v(s,w)=e_s^\top w = w(s), v^(s,w)=es⊤w=w(s),

也就是"每个状态一个参数"。

代回 TD-Linear：

w t + 1 = w t + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) e s t . w_{t+1}=w_t+\alpha_t\big(r_{t+1}+\gamma w_t(s_{t+1})-w_t(s_t)\big)e_{s_t}. wt+1=wt+αt(rt+1+γwt(st+1)−wt(st))est.

仅 s t s_t st 这一维被更新，左乘 e s t ⊤ e_{s_t}^\top est⊤ 得

w t + 1 ( s t ) = w t ( s t ) + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) , w_{t+1}(s_t)=w_t(s_t)+\alpha_t\big(r_{t+1}+\gamma w_t(s_{t+1})-w_t(s_t)\big), wt+1(st)=wt(st)+αt(rt+1+γwt(st+1)−wt(st)),

这正是表格 TD(0)。

结论：表格 = 线性逼近 + one-hot 特征。

和目标函数 J(w) 的关系

线性 on-policy 情况下，半梯度 TD 并不是直接最小化

J ( w ) = E S ∼ d π ⁣ [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] , J(w)=\mathbb E_{S\sim d_\pi}\!\big[(v_\pi(S)-\hat v(S,w))^2\big], J(w)=ES∼dπ[(vπ(S)−v^(S,w))2],

而是逼近投影 Bellman 解 。两者在一般情况下并不相同，但在很多问题上，这个解既可计算 又效果好。

若你确实想"真最小化" J ( w ) J(w) J(w)，需要能访问 v π v_\pi vπ 或用 MC 回报近似它，此时会回到 MC+函数逼近 的更新

w t + 1 = w t + α t ( g t − v ^ ( s t , w t ) ) ϕ ( s t ) , w_{t+1}=w_t+\alpha_t\,(g_t-\hat v(s_t,w_t))\,\phi(s_t), wt+1=wt+αt(gt−v^(st,wt))ϕ(st),

方差更大，但目标一致。

Theoretical analysis

The algorithm

w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big] \nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)

does not minimize the following objective function:

J ( w ) = E ⁣ [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] J(w) = \mathbb{E}\!\big[ ( v_\pi(S) - \hat v(S, w) )^2 \big] J(w)=E[(vπ(S)−v^(S,w))2]
Different objective functions
- Objective function 1: True value error
  复制代码
```
  $J_E(w) = \mathbb{E}\!\big[ ( v_\pi(S) - \hat v(S, w) )^2 \big]
```
  = | \hat v(w) - v_\pi |^2_D$
- Objective function 2: Bellman error
  
  J B E ( w ) = ∥ v ^ ( w ) − ( r π + γ P π v ^ ( w ) ) ∥ D 2 ≐ ∥ v ^ ( w ) − T π ( v ^ ( w ) ) ∥ D 2 J_{BE}(w) = \| \hat v(w) - (r_\pi + \gamma P_\pi \hat v(w)) \|^2_D \doteq \| \hat v(w) - T_\pi(\hat v(w)) \|^2_D JBE(w)=∥v^(w)−(rπ+γPπv^(w))∥D2≐∥v^(w)−Tπ(v^(w))∥D2
  - where
    
    T π ( x ) ≐ r π + γ P π x T_\pi(x) \doteq r_\pi + \gamma P_\pi x Tπ(x)≐rπ+γPπx
- Objective function 3: Projected Bellman error
  
  J P B E ( w ) = ∥ v ^ ( w ) − M T π ( v ^ ( w ) ) ∥ D 2 J_{PBE}(w) = \| \hat v(w) - M T_\pi(\hat v(w)) \|^2_D JPBE(w)=∥v^(w)−MTπ(v^(w))∥D2
  - where M M M is a projection matrix.

算法间的差距

True value error

J E ( w ) = E ⁣ [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] = ∥ v ^ ( w ) − v π ∥ D 2 J_E(w) = \mathbb{E}\!\big[(v_\pi(S) - \hat v(S, w))^2\big] = \| \hat v(w) - v_\pi \|^2_D JE(w)=E[(vπ(S)−v^(S,w))2]=∥v^(w)−vπ∥D2

含义：直接最小化近似值函数 v ^ ( s , w ) \hat v(s,w) v^(s,w) 与真实值函数 v π ( s ) v_\pi(s) vπ(s) 的差距。

理想目标：这是最自然、最直观的优化目标（类似 supervised learning）。

问题：我们 不知道 v π ( s ) v_\pi(s) vπ(s)，只能通过采样和 Bellman 方程间接近似，因此无法直接最小化这个目标。

Bellman error

J B E ( w ) = ∥ v ^ ( w ) − T π ( v ^ ( w ) ) ∥ D 2 J_{BE}(w) = \| \hat v(w) - T_\pi(\hat v(w)) \|^2_D JBE(w)=∥v^(w)−Tπ(v^(w))∥D2

其中

T π ( x ) = r π + γ P π x T_\pi(x) = r_\pi + \gamma P_\pi x Tπ(x)=rπ+γPπx

含义：衡量 v ^ ( s , w ) \hat v(s,w) v^(s,w) 与 Bellman 方程的不一致程度。

Bellman 方程的固定点就是 v π v_\pi vπ。

如果 v ^ ( w ) \hat v(w) v^(w) 落在"完美的"函数空间里，最小化 Bellman 误差就能得到真实值函数。

问题：当函数近似器（如线性函数或神经网络）不能精确表示 v π v_\pi vπ 时，直接最小化 Bellman 误差可能会得到"发散"的解。

Projected Bellman error

J P B E ( w ) = ∥ v ^ ( w ) − M T π ( v ^ ( w ) ) ∥ D 2 J_{PBE}(w) = \| \hat v(w) - M T_\pi(\hat v(w)) \|^2_D JPBE(w)=∥v^(w)−MTπ(v^(w))∥D2

其中 M 是一个投影矩阵，把 Bellman 更新结果 T π ( v ^ ) T_\pi(\hat v) Tπ(v^) 投影回函数近似空间。

含义：由于函数近似空间（比如线性函数空间）通常比较有限，我们无法保证 v ^ ( w ) 证 \hat v(w) 证v^(w) 可以完美满足 Bellman 方程。所以我们要求 投影后的 Bellman 更新结果 尽可能接近 v ^ ( w ) \hat v(w) v^(w)。

本质：寻找一个在函数空间中"最接近 Bellman 固定点"的近似解。

重要性：这是 TD-Linear 实际优化的目标 。TD 的更新过程相当于隐式地做了这个投影，因此收敛点是最小化 projected Bellman error 的解，而不是直接的 true value error。

为什么 TD-Linear 对应 projected Bellman error？

TD 更新：

w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big]\nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)

这里的更新使用的是 TD target r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1} + \gamma \hat v(s_{t+1}, w_t) rt+1+γv^(st+1,wt)，它相当于把 Bellman 更新结果 T π ( v ^ ) T_\pi(\hat v) Tπ(v^) 投影回函数近似空间。

因此，TD 不会直接最小化 J E ( w ) J_E(w) JE(w) 或 J B E ( w ) J_{BE}(w) JBE(w)，而是最小化 projected Bellman error。

总结：

True value error：最理想的目标，但无法直接计算。

Bellman error：衡量与 Bellman 方程的不一致，但函数逼近时可能不稳定。

Projected Bellman error：TD 实际最小化的目标，在逼近空间内找到最合理的解，保证了收敛性。

Sarsa & Q-learning with function approximation

Sarsa with function approximation

Sarsa algorithm

So far, we merely considered the problem of state value estimation. That is we hope

v ^ ≈ v π \hat v \approx v_\pi v^≈vπ
To search for optimal policies, we need to estimate action values.
The Sarsa algorithm with value function approximation is

w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) . w_{t+1} = w_t + \alpha_t \Big[r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t)\Big] \nabla_w \hat q(s_t, a_t, w_t). wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt).
This is the same as the algorithm we introduced previously in this lecture except that v ^ \hat v v^ is replaced by q ^ \hat q q^.

Pseudocode: Sarsa with function approximation

Aim : Search a policy that can lead the agent to the target from an initial state-action pair ( s 0 , a 0 ) (s_0, a_0) (s0,a0).
For each episode, do
- If the current s t s_t st is not the target state, do
  - Take action a t a_t at following π t ( s t ) \pi_t(s_t) πt(st), generate r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1, and then take action a t + 1 a_{t+1} at+1 following π t ( s t + 1 ) \pi_t(s_{t+1}) πt(st+1)
  - Value update (parameter update):
    
    w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) w_{t+1} = w_t + \alpha_t \Big[r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t)\Big]\nabla_w \hat q(s_t, a_t, w_t) wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt)
  - Policy update:
    
    π t + 1 ( a ∣ s t ) = 1 − ε ∣ A ( s ) ∣ ( ∣ A ( s ) ∣ − 1 ) if a = arg ⁡ max ⁡ a ∈ A ( s t ) q ^ ( s t , a , w t + 1 ) \pi_{t+1}(a|s_t) = 1 - \frac{\varepsilon}{|\mathcal{A}(s)|} (|\mathcal{A}(s)| - 1) \quad \text{if } a = \arg\max_{a \in \mathcal{A}(s_t)} \hat q(s_t, a, w_{t+1}) πt+1(a∣st)=1−∣A(s)∣ε(∣A(s)∣−1)if a=argmaxa∈A(st)q^(st,a,wt+1)
    
    π t + 1 ( a ∣ s t ) = ε ∣ A ( s ) ∣ otherwise \pi_{t+1}(a|s_t) = \frac{\varepsilon}{|\mathcal{A}(s)|} \quad \text{otherwise} πt+1(a∣st)=∣A(s)∣εotherwise

Sarsa with function approximation

公式：

w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) . w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t). wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt).

含义：

这里 q ^ ( s , a , w ) \hat q(s, a, w) q^(s,a,w) 是 近似的 action-value function ，用参数 w w w 来表示（比如线性函数或神经网络）。

TD target 使用的是 实际执行的下一步动作 a t + 1 a_{t+1} at+1：

r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) rt+1+γq^(st+1,at+1,wt)

这意味着 Sarsa 是 on-policy 算法：

行为策略 π \pi π 既用于生成数据（选择 a t a_t at， a t + 1 a_{t+1} at+1），

也用于更新价值函数。

更新方向由 TD error 决定：

δ t = r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) \delta_t = r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t) δt=rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)

然后对当前 q ^ ( s t , a t , w t ) \hat q(s_t, a_t, w_t) q^(st,at,wt) 的参数做梯度修正。

Q-learning with function approximation

Q-learning algorithm

Similar to Sarsa, tabular Q-learning can also be extended to the case of value function approximation.
The q-value update rule is

w t + 1 = w t + α t [ r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) , w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t), wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt),
which is the same as Sarsa except that q ^ ( s t + 1 , a t + 1 , w t ) \hat q(s_{t+1}, a_{t+1}, w_t) q^(st+1,at+1,wt) is replaced by max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) maxa∈A(st+1)q^(st+1,a,wt).

Pseudocode: Q-learning with function approximation (on-policy version)

Initialization: Initial parameter vector w 0 w_0 w0. Initial policy π 0 \pi_0 π0. Small ε > 0 \varepsilon > 0 ε>0.
Aim: Search a good policy that can lead the agent to the target from an initial state-action pair ( s 0 , a 0 ) (s_0, a_0) (s0,a0).
For each episode, do
- If the current s t s_t st is not the target state, do
  - Take action a t a_t at following π t ( s t ) \pi_t(s_t) πt(st), and generate r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1
  - Value update (parameter update):
    
    w t + 1 = w t + α t [ r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t) wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt)
  - Policy update:
    
    π t + 1 ( a ∣ s t ) = 1 − ε ∣ A ( s ) ∣ ( ∣ A ( s ) ∣ − 1 ) if a = arg ⁡ max ⁡ a ∈ A ( s t ) q ^ ( s t , a , w t + 1 ) \pi_{t+1}(a|s_t) = 1 - \frac{\varepsilon}{|\mathcal{A}(s)|} (|\mathcal{A}(s)| - 1) \quad \text{if } a = \arg\max_{a \in \mathcal{A}(s_t)} \hat q(s_t, a, w_{t+1}) πt+1(a∣st)=1−∣A(s)∣ε(∣A(s)∣−1)if a=argmaxa∈A(st)q^(st,a,wt+1)
    
    π t + 1 ( a ∣ s t ) = ε ∣ A ( s ) ∣ otherwise \pi_{t+1}(a|s_t) = \frac{\varepsilon}{|\mathcal{A}(s)|} \quad \text{otherwise} πt+1(a∣st)=∣A(s)∣εotherwise

Q-learning with function approximation

公式：

w t + 1 = w t + α t [ r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) . w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t). wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt).

含义：

同样 q ^ ( s , a , w ) \hat q(s, a, w) q^(s,a,w) 是参数化的 Q 函数。

TD target 使用的是 下一状态所有可能动作的最大值：

r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)

这意味着 Q-learning 是 off-policy 算法：

行为策略（behavior policy）可以是探索性的，比如 ϵ \epsilon ϵ-greedy。

但是更新时假设 agent 永远选择"最优动作"，因为取了 max ⁡ \max max。

更新方式仍然基于 TD error：

δ t = r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) \delta_t = r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) δt=rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)

再做参数更新。

Deep Q-learning

Objective function

Definition
- Deep Q-learning aims to minimize the objective function/loss function:
  
  J ( w ) = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] , J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right], J(w)=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2],
  - where ( S , A , R , S ' ) (S, A, R, S') (S,A,R,S') are random variables.
- This is actually the Bellman optimality error.
- That is because
  
  q ( s , a ) = E [ R t + 1 + γ max ⁡ a ∈ A ( S t + 1 ) q ( S t + 1 , a ) ∣ S t = s , A t = a ] , ∀ s , a q(s, a) = \mathbb{E}\Big[ R_{t+1} + \gamma \max_{a \in \mathcal{A}(S_{t+1})} q(S_{t+1}, a) \,\Big|\, S_t = s, A_t = a \Big], \quad \forall s, a q(s,a)=E[Rt+1+γmaxa∈A(St+1)q(St+1,a) St=s,At=a],∀s,a
  - The value of
    
    R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w)
  - should be zero in the expectation sense.
How to minimize the objective function? Gradient-descent!
- In this objective function
  
  J ( w ) = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] , J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right], J(w)=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2],
  - the parameter w not only appears in q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w) but also in
    
    y ≐ R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) . y \doteq R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w). y≐R+γmaxa∈A(S')q^(S',a,w).
- For the sake of simplicity, we can assume that w w w in y y y is fixed (at least for a while) when we calculate the gradient.

Deep Q-learning 的目标函数与 Bellman optimality error 的关系：

Deep Q-learning 的目标函数

J ( w ) = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right] J(w)=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2]

其中：

R R R：当前状态执行动作 A A A 后得到的奖励；

S ' S' S'：下一个状态；

q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w)：由神经网络（带参数 w w w）给出的 Q Q Q 值近似；

max ⁡ a ∈ A ( S ' ) \max_{a \in \mathcal{A}(S')} maxa∈A(S')：在下一个状态选择最优动作对应的 Q Q Q 值。

这个目标函数就是一个 均方误差 (MSE) ，它度量的是 Q Q Q 网络的输出 q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w) 与目标值 R + γ max ⁡ a q ^ ( S ' , a , w ) R + \gamma \max_{a} \hat{q}(S', a, w) R+γmaxaq^(S',a,w) 之间的差距。

为什么它对应 Bellman optimality error？

Bellman 最优方程定义了最优 Q Q Q 值的递推关系：

q ( s , a ) = E [ R t + 1 + γ max ⁡ a ' ∈ A ( S t + 1 ) q ( S t + 1 , a ' ) ∣ S t = s , A t = a ] q(s, a) = \mathbb{E}\Big[ R_{t+1} + \gamma \max_{a' \in \mathcal{A}(S_{t+1})} q(S_{t+1}, a') \,\Big|\, S_t = s, A_t = a \Big] q(s,a)=E[Rt+1+γmaxa'∈A(St+1)q(St+1,a') St=s,At=a]

换句话说，如果 q ^ \hat{q} q^ 是最优的，那么：

R + γ max ⁡ a q ^ ( S ' , a , w ) − q ^ ( S , A , w ) = 0 R + \gamma \max_{a} \hat{q}(S', a, w) - \hat{q}(S, A, w) = 0 R+γmaxaq^(S',a,w)−q^(S,A,w)=0

在期望意义下应该完全成立。

但是实际中 q ^ \hat{q} q^ 是近似函数，所以它并不能完全满足 Bellman 方程。

于是我们就把这个残差（差距）定义为 Bellman optimality error，并通过最小化它来逼近最优 Q 值。

为什么优化比较 tricky？

在损失函数

J ( w ) = E [ ( R + γ max ⁡ a q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right] J(w)=E[(R+γmaxaq^(S',a,w)−q^(S,A,w))2]

里面，参数 w w w 既出现在当前的 q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w)，也出现在目标值 R + γ max ⁡ a q ^ ( S ' , a , w ) R + \gamma \max_{a} \hat{q}(S', a, w) R+γmaxaq^(S',a,w) 中。

这就导致梯度计算比较复杂，因为我们同时要对 预测值 和 目标值 求梯度。

为简化计算，DQN 通常采用 固定目标网络（target network） 的方法：

在一段时间内，把目标部分 R + γ max ⁡ a q ^ ( S ' , a , w ) R + \gamma \max_{a} \hat{q}(S', a, w) R+γmaxaq^(S',a,w) 的参数 w w w 固定；

只更新当前 Q 网络的参数。

这样就可以避免梯度传播的复杂性。

Two networks

Introduction
- One is a main network representing q ^ ( s , a , w ) \hat q(s,a,w) q^(s,a,w)
- The other is a target network q ^ ( s , a , w T ) \hat q(s,a,w_T) q^(s,a,wT).
- The objective function in this case degenerates to
  
  J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) 2 ] , J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)^2\Big], J=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))2],
  - where w T w_T wT is the target network parameter.
Gradient with fixed target network
- When w T w_T wT is fixed, the gradient of J J J can be easily obtained as
  
  ∇ w J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) ∇ w q ^ ( S , A , w ) ] . \nabla_w J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)\nabla_w \hat q(S,A,w)\Big]. ∇wJ=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))∇wq^(S,A,w)].
- The basic idea of deep Q-learning is to use the gradient-descent algorithm to minimize the objective function.

在 DQN 里，如果只用一个网络 q ^ ( s , a , w ) \hat q(s,a,w) q^(s,a,w) 来估计 Q Q Q 值并同时更新参数，会遇到 训练不稳定 的问题。原因是目标值（TD target）和估计值（prediction）都依赖于同一个网络，参数更新会相互干扰。

解决办法：

引入 两个网络 ：

Main network ： q ^ ( s , a , w ) \hat q(s,a,w) q^(s,a,w)，用来学习和更新参数。

Target network ： q ^ ( s , a , w T ) \hat q(s,a,w_T) q^(s,a,wT)，用来生成相对稳定的目标值。

参数 w T w_T wT 会定期从 w w w 同步（例如每隔 C C C 步复制一次）。

这样目标值不会在每一步都随 w w w 的更新而改变，从而降低训练震荡。

目标函数：

J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) 2 ] J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)^2\Big] J=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))2]

当前 Q 值估计： q ^ ( S , A , w ) \hat q(S,A,w) q^(S,A,w)

目标 Q 值（TD target）： R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) R+γmaxa∈A(S')q^(S',a,wT)

梯度下降更新：

∇ w J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) ∇ w q ^ ( S , A , w ) ] \nabla_w J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)\nabla_w \hat q(S,A,w)\Big] ∇wJ=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))∇wq^(S,A,w)]

当 w T w_T wT 固定时，梯度计算非常清晰，不会被目标值同时更新而扰动。

总结

Two networks （Main & Target）解决了 目标值不稳定 的问题。

Two techniques

First technique: Two networks, a main network and a target network
- Why is it used?
  - The mathematical reason has been explained when we calculate the gradient.
- Implementation details:
  - Let w w w and w T w_T wT denote the parameters of the main and target networks, respectively. They are set to be the same initially.
  - In every iteration, we draw a mini-batch of samples ( s , a , r , s ' ) {(s,a,r,s')} (s,a,r,s') from the replay buffer (will be explained later).
  - The inputs of the networks include state s s s and action a a a.
    - The target output is
      
      y T ≐ r + γ max ⁡ a ∈ A ( s ' ) q ^ ( s ' , a , w T ) . y_T \doteq r + \gamma \max_{a\in \mathcal{A}(s')} \hat q(s',a,w_T). yT≐r+γmaxa∈A(s')q^(s',a,wT).
    - Then, we directly minimize the TD error or called loss function
      
      ( y T − q ^ ( s , a , w ) ) 2 (y_T - \hat q(s,a,w))^2 (yT−q^(s,a,w))2
      - over the mini-batch ( s , a , y T ) {(s,a,y_T)} (s,a,yT).
Another technique: Experience replay
- Question: What is experience replay?
- Answer:
  - After we have collected some experience samples, we do NOT use these samples in the order they were collected.
  - Instead, we store them in a set, called replay buffer B ≐ ( s , a , r , s ' ) \mathcal{B} \doteq {(s,a,r,s')} B≐(s,a,r,s').
  - Every time we train the neural network, we can draw a mini-batch of random samples from the replay buffer.
  - The draw of samples, or called experience replay , should follow a uniform distribution (why?).
- Question: Why is experience replay necessary in deep Q-learning? Why does the replay must follow a uniform distribution?
- Answer: The answers lie in the objective function.
  
  J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J = \mathbb{E}\left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) \right)^2 \right] J=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2]
  - ( S , A ) ∼ d : ( S , A ) (S, A) \sim d: (S, A) (S,A)∼d:(S,A) is an index and treated as a single random variable
  - R ∼ p ( R ∣ S , A ) , S ' ∼ p ( S ' ∣ S , A ) : R R \sim p(R|S,A), S' \sim p(S'|S,A): R R∼p(R∣S,A),S'∼p(S'∣S,A):R and S S S are determined by the system model.
  - The distribution of the state-action pair ( S , A ) (S, A) (S,A) is assumed to be uniform.
  - However, the samples are not uniformly collected because they are generated consequently by certain policies.
  - To break the correlation between consequent samples, we can use the experience replay technique by uniformly drawing samples from the replay buffer.
  - This is the mathematical reason why experience replay is necessary and why the experience replay must be uniform.

Experience replay (经验回放)

问题：

如果我们按顺序使用交互数据来更新网络，样本之间是强相关的（例如 s t , s t + 1 , s t + 2 s_t, s_{t+1}, s_{t+2} st,st+1,st+2），不符合随机采样的假设，会导致训练不稳定甚至发散。

解决办法：

引入 Replay Buffer B \mathcal{B} B 来存储过往的经验 ( s , a , r , s ' ) (s,a,r,s') (s,a,r,s')。

每次训练时，不是直接用最近的数据，而是 随机抽取一个 mini-batch 来打破样本相关性。

数学解释：

目标函数为：

J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J = \mathbb{E}\left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) \right)^2 \right] J=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2]

( S , A ) ∼ d (S,A) \sim d (S,A)∼d: ( S , A ) (S,A) (S,A) 被看作一个随机变量

理论上， ( S , A ) (S,A) (S,A) 的分布应该是 均匀的

但实际收集的数据由当前策略产生，不是均匀分布的（可能更集中在某些区域）

经验回放的作用：

打破样本相关性（避免梯度更新时出现偏差）。

近似均匀采样 ，使得 ( S , A ) (S,A) (S,A) 的经验分布接近理论假设的均匀分布。

提高数据利用率（同一个样本可以被多次使用）。

总结

Experience replay 解决了 样本相关性与分布偏差 的问题。

Revisit the tabular case:

Question: Why does not tabular Q-learning require experience replay?
- Answer: No uniform distribution requirement.
Question: Why Deep Q-learning involves distribution?
- Answer: The objective function in the deep case is a scalar average over all ( S , A ) (S, A) (S,A).
  
  The tabular case does not involve any distribution of S S S or A A A.
  
  The algorithm in the tabular case aims to solve a set of equations for all ( s , a ) (s,a) (s,a) (Bellman optimality equation).
Question: Can we use experience replay in tabular Q-learning?
- Answer: Yes, we can. And more sample efficient (why?).

为什么 tabular Q-learning 不需要经验回放，而 deep Q-learning 需要

Tabular Q-learning 的特点

存储方式 ：每个状态-动作对 ( s , a ) (s,a) (s,a) 都有一个对应的 Q Q Q 值表项 Q ( s , a ) Q(s,a) Q(s,a)。

更新方式 ：更新是 局部的 ，只影响当前的 ( s , a ) (s,a) (s,a)：

Q ( s , a ) ← Q ( s , a ) + α [ r + γ max ⁡ a ' Q ( s ' , a ' ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha \Big[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \Big] Q(s,a)←Q(s,a)+α[r+γmaxa'Q(s',a')−Q(s,a)]

无分布要求：

在表格方法中，我们实际上是在"解方程"（Bellman 方程组），只要所有 ( s , a ) (s,a) (s,a) 都被访问到，无论采样分布是否均匀，最终都能收敛到最优解 Q ∗ Q^* Q∗。

因此，不需要对采样分布做均匀化的要求，也就不需要经验回放来打破采样的相关性。

Deep Q-learning 的特点

存储方式 ： Q Q Q 值不是用表格存储，而是用 神经网络近似：

Q ( s , a ; w ) ≈ Q ∗ ( s , a ) Q(s,a;w) \approx Q^*(s,a) Q(s,a;w)≈Q∗(s,a)

参数 w w w 是共享的，因此一次更新会影响 所有 ( s , a ) (s,a) (s,a) 的估计，而不是仅仅一个表项。

目标函数：

深度 Q 学习的目标函数是一个 均方误差 (MSE)：

J ( w ) = E [ ( r + γ max ⁡ a ' Q ( s ' , a ' ; w ) − Q ( s , a ; w ) ) 2 ] J(w) = \mathbb{E}\Big[\big(r + \gamma \max_{a'} Q(s',a';w) - Q(s,a;w)\big)^2\Big] J(w)=E[(r+γmaxa'Q(s',a';w)−Q(s,a;w))2]

注意：这里的期望是对 状态-动作对 ( s , a ) (s,a) (s,a) 的分布 取的。

分布问题：

如果训练样本高度相关（例如连续从同一个 episode 采样），网络会过拟合局部轨迹，梯度估计偏差很大。

目标函数隐含假设 ( S , A ) (S,A) (S,A) 是独立同分布 (i.i.d.)，但实际 RL 环境中采样是序列相关的。

为什么 Deep Q-learning 需要经验回放

经验回放 (Experience Replay) 做了两件事：

打破相关性：从 replay buffer 中均匀采样，打乱序列相关性，近似满足 i.i.d. 假设。

提高样本利用率：同一个样本可以被多次采样更新，而不是一次性丢弃。

数学解释 ：

如果我们不使用经验回放，那么在计算期望时：

( S , A ) ∼ d π (S,A) \sim d_\pi (S,A)∼dπ

这个分布 d π d_\pi dπ 会强烈依赖于当前策略和轨迹，导致梯度估计不稳定。

使用 replay buffer 并均匀采样后，可以近似模拟出一个 接近均匀的采样分布，稳定训练。

对比总结

Tabular Q-learning ：更新是局部的，不需要采样分布均匀性，只要覆盖所有 ( s , a ) (s,a) (s,a)，就能收敛。

Deep Q-learning：更新是全局的，依赖于目标函数的期望分布，需要经验回放来保证样本分布近似均匀，避免梯度偏差。

Pseudocode: Deep Q-learning (off-policy version)

Aim: Learn an optimal target network to approximate the optimal action values from the experience samples generated by a behavior policy π b \pi_b πb.
Store the experience samples generated by π b \pi_b πb in a replay buffer B = { ( s , a , r , s ' ) } \mathcal{B} = \{(s,a,r,s')\} B={(s,a,r,s')}
- For each iteration, do
  - Uniformly draw a mini-batch of samples from B \mathcal{B} B
  - For each sample ( s , a , r , s ' ) (s,a,r,s') (s,a,r,s'), calculate the target value as
    
    y T = r + γ max ⁡ a ∈ A ( s ' ) q ^ ( s ' , a , w T ) , y_T = r + \gamma \max_{a \in \mathcal{A}(s')} \hat q(s',a,w_T), yT=r+γmaxa∈A(s')q^(s',a,wT),
    - where w T w_T wT is the parameter of the target network
- Update the main network to minimize
  
  ( y T − q ^ ( s , a , w ) ) 2 (y_T - \hat q(s,a,w))^2 (yT−q^(s,a,w))2
  - using the mini-batch { ( s , a , y T ) } \{(s,a,y_T)\} {(s,a,yT)}
- Set w T = w w_T = w wT=w every C C C iterations

总结

从表格方法 (Tabular Q-learning) 到函数逼近 (Sarsa/Q-learning with function approximation)，再到深度强化学习 (DQN)，核心都是最小化 Bellman 误差，区别在于：

表格方法直接解方程，不依赖样本分布；
函数逼近需要引入梯度下降与投影；
深度 Q-learning 则通过目标网络 (Target Network) 和经验回放 (Experience Replay) 稳定训练神经网络近似器。

RL【8】：Value Function Approximation

系列文章目录

Fundamental Tools

Algorithm

Method

文章目录

前言

Algorithm for state value estimation

Objective function

Optimization algorithms

Selection of function approximators

Theoretical analysis

Sarsa & Q-learning with function approximation

Sarsa with function approximation

Q-learning with function approximation

Deep Q-learning

总结