系列文章目录
Fundamental Tools
RL【1】:Basic Concepts
RL【2】:Bellman Equation
RL【3】:Bellman Optimality Equation
Algorithm
RL【4】:Value Iteration and Policy Iteration
RL【5】:Monte Carlo Learning
RL【6】:Stochastic Approximation and Stochastic Gradient Descent
Method
RL【7-1】:Temporal-difference Learning
RL【7-2】:Temporal-difference Learning
文章目录
- 系列文章目录
- 前言
- [Algorithm for state value estimation](#Algorithm for state value estimation)
-
- [Objective function](#Objective function)
- [Optimization algorithms](#Optimization algorithms)
- [Selection of function approximators](#Selection of function approximators)
- [Theoretical analysis](#Theoretical analysis)
- [Sarsa & Q-learning with function approximation](#Sarsa & Q-learning with function approximation)
-
- [Sarsa with function approximation](#Sarsa with function approximation)
- [Q-learning with function approximation](#Q-learning with function approximation)
- [Deep Q-learning](#Deep Q-learning)
- 总结
前言
本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning
Algorithm for state value estimation
Objective function
Formal introduction
- Let v π ( s ) v_\pi(s) vπ(s) and v ^ ( s , w ) \hat v(s,w) v^(s,w) be the true state value and a function for approximation.
- Our goal is to find an optimal w so that v ^ ( s , w ) \hat v(s,w) v^(s,w) can best approximate v π ( s ) v_\pi(s) vπ(s) for every s s s.
- This is a policy evaluation problem. Later we will extend to policy improvement.
- To find the optimal w w w, we need two steps .
- The first step is to define an objective function.
- The second step is to derive algorithms optimizing the objective function.
问题背景:函数近似的价值估计
- 在实际强化学习中,状态空间可能非常大(甚至连续),没法为每个状态单独存储一个 v π ( s ) v_\pi(s) vπ(s)。
- 因此我们需要用一个函数近似器(比如线性函数、神经网络) v ^ ( s , w ) \hat v(s,w) v^(s,w) 来逼近真实的 v π ( s ) v_\pi(s) vπ(s)。
- 目标 :找到一组参数 w w w,使得 v ^ ( s , w ) \hat v(s,w) v^(s,w) 尽可能接近 v π ( s ) v_\pi(s) vπ(s)。
Objective function
J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] . J(w)=\mathbb{E}\!\left[(\,v_\pi(S)-\hat v(S,w)\,)^2\right]. J(w)=E[(vπ(S)−v^(S,w))2].
- Our goal is to find the best w that can minimize J ( w ) J(w) J(w).
- The expectation is with respect to the random variable S ∈ S S\in\mathcal{S} S∈S.
Several ways to define the probability distribution of S S S
- The first way is to use a uniform distribution
-
That is to treat all the states to be equally important by setting the probability of each state as 1 / ∣ S ∣ 1/|\mathcal{S}| 1/∣S∣.
-
In this case, the objective function becomes
J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] = 1 ∣ S ∣ ∑ s ∈ S ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \mathbb{E}[(v_\pi(S) - \hat v(S, w))^2] = \frac{1}{|\mathcal{S}|} \sum_{s \in \mathcal{S}} (v_\pi(s) - \hat v(s, w))^2. J(w)=E[(vπ(S)−v^(S,w))2]=∣S∣1∑s∈S(vπ(s)−v^(s,w))2.
-
Drawback:
- The states may not be equally important. For example, some states may be rarely visited by a policy. Hence, this way does not consider the real dynamics of the Markov process under the given policy.
-
- The second way is to use the stationary distribution
-
Stationary distribution is an important concept that will be frequently used in this course. In short, it describes the long-run behavior of a Markov process.
-
Let d π ( s ) s ∈ S {d_\pi(s)}{s \in \mathcal{S}} dπ(s)s∈S denote the stationary distribution of the Markov process under policy π \pi π. By definition, d π ( s ) ≥ 0 d\pi(s) \ge 0 dπ(s)≥0 and ∑ s ∈ S d π ( s ) = 1 \sum_{s \in \mathcal{S}} d_\pi(s) = 1 ∑s∈Sdπ(s)=1.
-
The objective function can be rewritten as
J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] = ∑ s ∈ S d π ( s ) ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \mathbb{E}[(v_\pi(S) - \hat v(S, w))^2] = \sum_{s \in \mathcal{S}} d_\pi(s)(v_\pi(s) - \hat v(s, w))^2. J(w)=E[(vπ(S)−v^(S,w))2]=∑s∈Sdπ(s)(vπ(s)−v^(s,w))2.
-
This function is a weighted squared error.
-
Since more frequently visited states have higher values of d π ( s ) d_\pi(s) dπ(s), their weights in the objective function are also higher than those rarely visited states.
-
状态分布的选择 ------ 两种方式
关键问题:**期望 E \mathbb{E} E 是对哪个分布下的状态 S S S 取的?**这会影响训练出来的近似器"偏向于哪些状态更准确"。
- Uniform distribution
做法:假设所有状态都等重要,给每个状态分配相同概率:
P ( S = s ) = 1 ∣ S ∣ . P(S=s) = \frac{1}{|\mathcal{S}|}. P(S=s)=∣S∣1.
目标函数变为:
J ( w ) = 1 ∣ S ∣ ∑ s ∈ S ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \frac{1}{|\mathcal{S}|}\sum_{s\in\mathcal{S}} \big(v_\pi(s) - \hat v(s,w)\big)^2. J(w)=∣S∣1∑s∈S(vπ(s)−v^(s,w))2.
优点:
- 简单直观,保证所有状态都有"平等对待"。
缺点:
- 不符合实际。现实中某些状态出现得很少(比如游戏里的罕见场景),强行要求对它们也拟合得很好,会浪费模型容量。
- 没有体现马尔可夫过程在策略 π \pi π 下的真实动态。
- Stationary distribution
做法 :考虑在策略 π \pi π 下,环境长期运行后,每个状态出现的概率分布 d π ( s ) d_\pi(s) dπ(s)(即 stationary distribution)。
目标函数变为:
J ( w ) = ∑ s ∈ S d π ( s ) , ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \sum_{s \in \mathcal{S}} d_\pi(s),\big(v_\pi(s) - \hat v(s,w)\big)^2. J(w)=∑s∈Sdπ(s),(vπ(s)−v^(s,w))2.
优点:
- 更贴合真实情况,因为智能体在实际运行时,会频繁遇到某些状态而几乎不会遇到另外一些。
- 在这些"高频状态"上的估计更准确,提升实际执行效果。
缺点:
- 低频状态可能拟合得很差,尤其当某些关键状态虽然重要但很少出现时。
Stationary Distribution:
- Distribution: Distribution of the state
- Stationary: Long-run behavior
- Summary: after the agent runs a long time following a policy, the probability that the agent is at any state can be described by this distribution.
stationary distribution(平稳分布/稳态分布) 相关的基本概念
1. Stationary Distribution 的定义
在马尔可夫过程 (Markov Process) 或马尔可夫决策过程 (MDP) 中,智能体随着时间不断转移状态。
Stationary distribution 指的是:当智能体运行足够长时间后,落在每个状态的概率分布。
数学形式:如果 d π ( s ) s ∈ S {d_\pi(s)}{s \in \mathcal{S}} dπ(s)s∈S 是策略 π \pi π 下的 stationary distribution,则有:
d π ( s ' ) = ∑ s ∈ S d π ( s ) ∑ a ∈ A π ( a ∣ s ) P ( s ' ∣ s , a ) , d\pi(s') = \sum_{s \in \mathcal{S}} d_\pi(s) \sum_{a \in \mathcal{A}} \pi(a|s) P(s'|s,a), dπ(s')=∑s∈Sdπ(s)∑a∈Aπ(a∣s)P(s'∣s,a),
并且
∑ s ∈ S d π ( s ) = 1 , d π ( s ) ≥ 0. \sum_{s \in \mathcal{S}} d_\pi(s) = 1, \quad d_\pi(s) \ge 0. ∑s∈Sdπ(s)=1,dπ(s)≥0.
- 相关的基本概念
- Distribution
- 字面意义:某个变量的概率分布。
- 在这里是 状态分布 :即智能体处于每个状态 s ∈ S s \in \mathcal{S} s∈S 的概率。
- Stationary
- 指的是 长期稳定 的状态。
- 当时间 t → ∞ t \to \infty t→∞ 时,状态分布趋于固定值,不再随着时间波动。
- 也就是说,状态分布收敛到了一个平衡点。
- Steady-state distribution / Limiting distribution
- 同义词 :Stationary distribution 也常被称为 steady-state distribution(稳态分布) 或 limiting distribution(极限分布)。
- 强调的是:它是一个长期极限意义下的稳定分布。
- 在强化学习中的意义
价值函数逼近
在 近似方法(value function approximation) 中,我们用期望定义目标函数:
J ( w ) = E s ∼ d π [ ( v π ( s ) − v ^ ( s , w ) ) 2 ] , J(w) = \mathbb{E}{s \sim d\pi} \big[(v_\pi(s) - \hat v(s,w))^2\big], J(w)=Es∼dπ[(vπ(s)−v^(s,w))2],
- 其中 d π d_\pi dπ 就是 stationary distribution。
这样我们对常访问的状态赋予更大权重,更符合策略的实际表现。
策略梯度 (Policy Gradient)
在 策略梯度方法 中,性能目标函数通常写为:
J ( π ) = ∑ s d π ( s ) ∑ a π ( a ∣ s ) q π ( s , a ) . J(\pi) = \sum_s d_\pi(s) \sum_a \pi(a|s) q_\pi(s,a). J(π)=∑sdπ(s)∑aπ(a∣s)qπ(s,a).
这里 d π ( s ) d_\pi(s) dπ(s) 表示智能体在策略 π \pi π 下,长期处于状态 s s s 的概率。
因此,stationary distribution 是策略梯度方法的核心组成部分。
直观解释
- 如果智能体执行一个策略 π \pi π 很久以后:
- 常访问的状态在 d π ( s ) d_\pi(s) dπ(s) 下概率更高;
- 很少访问的状态概率接近 0 0 0。
- 所以 d π ( s ) d_\pi(s) dπ(s) 反映了该策略下"现实中真正重要的状态"。
- 总结:
- stationary distribution 描述了策略 π \pi π 下智能体 长期访问状态的概率分布。
- 它也叫 steady-state distribution 或 limiting distribution。
- 在 价值函数逼近 和 策略梯度方法 中起着关键作用,因为它决定了优化时不同状态的重要性权重。
Optimization algorithms
Gradient Descent for Value Function Approximation
-
While we have the objective function, the next step is to optimize it.
-
To minimize the objective function J ( w ) J(w) J(w), we can use the gradient-descent algorithm:
w k + 1 = w k − α k ∇ w J ( w k ) w_{k+1} = w_k - \alpha_k \nabla_w J(w_k) wk+1=wk−αk∇wJ(wk)
-
The true gradient is:
∇ w J ( w ) = ∇ w E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] \nabla_w J(w) = \nabla_w \mathbb{E}[(v_\pi(S) - \hat v(S,w))^2] ∇wJ(w)=∇wE[(vπ(S)−v^(S,w))2]
= E [ ∇ w ( v π ( S ) − v ^ ( S , w ) ) 2 ] = \mathbb{E}[\nabla_w (v_\pi(S) - \hat v(S,w))^2] =E[∇w(vπ(S)−v^(S,w))2]
= 2 E [ ( v π ( S ) − v ^ ( S , w ) ) ( − ∇ w v ^ ( S , w ) ) ] = 2\mathbb{E}[(v_\pi(S) - \hat v(S,w))(-\nabla_w \hat v(S,w))] =2E[(vπ(S)−v^(S,w))(−∇wv^(S,w))]
= − 2 E [ ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) ] = -2\mathbb{E}[(v_\pi(S) - \hat v(S,w))\nabla_w \hat v(S,w)] =−2E[(vπ(S)−v^(S,w))∇wv^(S,w)]
-
The true gradient above involves the calculation of an expectation.
Stochastic Gradient
-
We can use the stochastic gradient to replace the true gradient:
w t + 1 = w t + α t ( v π ( s t ) − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) , w_{t+1} = w_t + \alpha_t (v_\pi(s_t) - \hat v(s_t, w_t)) \nabla_w \hat v(s_t, w_t), wt+1=wt+αt(vπ(st)−v^(st,wt))∇wv^(st,wt),
- where s t s_t st is a sample of S S S. Here, 2 α k 2\alpha_k 2αk is merged to α k \alpha_k αk.
-
This algorithm is not implementable because it requires the true state value v π v_\pi vπ, which is the unknown to be estimated.
-
We can replace v π ( s t ) v_\pi(s_t) vπ(st) with an approximation so that the algorithm is implementable.
Monte Carlo and TD Learning with Function Approximation
-
First, Monte Carlo learning with function approximation
Let g t g_t gt be the discounted return starting from s t s_t st in the episode. Then, g t g_t gt can be used to approximate v π ( s t ) v_\pi(s_t) vπ(st). The algorithm becomes:
w t + 1 = w t + α t ( g t − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t (g_t - \hat v(s_t, w_t)) \nabla_w \hat v(s_t, w_t). wt+1=wt+αt(gt−v^(st,wt))∇wv^(st,wt).
-
Second, TD learning with function approximation
By the spirit of TD learning, r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1} + \gamma \hat v(s_{t+1}, w_t) rt+1+γv^(st+1,wt) can be viewed as an approximation of v π ( s t ) v_\pi(s_t) vπ(st). Then, the algorithm becomes:
w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t [r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t)] \nabla_w \hat v(s_t, w_t). wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt).
深入解释
为什么要用 Gradient Descent?
我们有一个目标函数:
J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] . J(w)=\mathbb{E}[(v_\pi(S)-\hat v(S,w))^2]. J(w)=E[(vπ(S)−v^(S,w))2].
目标是最小化 真实状态价值 v π ( S ) v_\pi(S) vπ(S) 与 近似函数 v ^ ( S , w ) \hat v(S,w) v^(S,w) 的误差。
这相当于一个 回归问题 :拟合一个函数 v ^ \hat v v^ 来逼近真实的 v π v_\pi vπ。
于是,可以使用最常见的优化方法 ------ 梯度下降 (Gradient Descent):
w k + 1 = w k − α k ∇ w J ( w k ) . w_{k+1} = w_k - \alpha_k \nabla_w J(w_k). wk+1=wk−αk∇wJ(wk).
- 也就是说,每一步更新参数 w w w,使得 J ( w ) J(w) J(w) 逐渐减小。
真梯度 (True Gradient) 的含义
通过链式法则展开:
∇ w J ( w ) = − 2 E [ ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) ] . \nabla_w J(w) = -2\mathbb{E}[(v_\pi(S) - \hat v(S,w)) \nabla_w \hat v(S,w)]. ∇wJ(w)=−2E[(vπ(S)−v^(S,w))∇wv^(S,w)].
解释:
- 误差项 ( v π ( S ) − v ^ ( S , w ) ) (v_\pi(S) - \hat v(S,w)) (vπ(S)−v^(S,w)) 表示预测与真实值之间的差距。
- 乘上 ∇ w v ^ ( S , w ) \nabla_w \hat v(S,w) ∇wv^(S,w),告诉我们 怎样调整参数 w w w 才能缩小这个差距。
- 负号说明:如果预测比真实值小,就增加 v ^ \hat v v^;反之减少 v ^ \hat v v^。
这和标准的监督学习回归完全一致。
为什么要用 Stochastic Gradient?
问题在于:
- 真梯度 ∇ w J ( w ) \nabla_w J(w) ∇wJ(w) 涉及对所有状态 S S S 的期望;
- 这通常不可行,因为状态空间巨大,而且 v π ( S ) v_\pi(S) vπ(S) 也未知。
于是用 SGD (Stochastic Gradient Descent) 替代:
w t + 1 = w t + α t ( v π ( s t ) − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) , w_{t+1} = w_t + \alpha_t (v_\pi(s_t) - \hat v(s_t,w_t)) \nabla_w \hat v(s_t,w_t), wt+1=wt+αt(vπ(st)−v^(st,wt))∇wv^(st,wt),
- 其中 s t s_t st 是一个采样的状态。
- 好处:只需要一个样本就能更新,成本低。
- 问题:它依然需要 v π ( s t ) v_\pi(s_t) vπ(st),但这正是我们要估计的未知量。
怎么替代 v π ( s t ) v_\pi(s_t) vπ(st)?
因为 v π ( s t ) v_\pi(s_t) vπ(st) 无法直接获得,我们需要找到可以近似它的替代量:
- Monte Carlo Learning with Function Approximation
使用 episode 完整回报 g t g_t gt 作为 v π ( s t ) v_\pi(s_t) vπ(st) 的无偏估计。
更新公式:
w t + 1 = w t + α t ( g t − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t (g_t - \hat v(s_t, w_t)) \nabla_w \hat v(s_t, w_t). wt+1=wt+αt(gt−v^(st,wt))∇wv^(st,wt).
直观理解:
- g t g_t gt = 从 s t s_t st 出发一路走到底的累计奖励。
- 用 g t g_t gt 替代 v π ( s t ) v_\pi(s_t) vπ(st),再做梯度下降。
- 缺点:要等 整条轨迹结束 才能更新;方差大。
- TD Learning with Function Approximation
使用 TD Target r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1} + \gamma \hat v(s_{t+1}, w_t) rt+1+γv^(st+1,wt) 来近似 v π ( s t ) v_\pi(s_t) vπ(st)。
更新公式:
w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t \big[r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t)\big] \nabla_w \hat v(s_t, w_t). wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt).
直观理解:
- 不用等整条轨迹,只看一步奖励 r t + 1 r_{t+1} rt+1 加上下一个状态的预测。
- 属于 bootstrapping:用已有估计来辅助更新。
- 优点:在线学习,更新快,方差小;缺点:可能引入偏差。
Pseudocode: TD learning with function approximation
- Initialization: A function v ^ ( s , w ) \hat v(s,w) v^(s,w) that is a differentiable in w w w. Initial parameter w 0 w_0 w0.
- Aim: Approximate the true state values of a given policy π \pi π.
- For each episode generated following the policy π \pi π, do
- For each step ( s t , r t + 1 , s t + 1 ) (s_t, r_{t+1}, s_{t+1}) (st,rt+1,st+1), do
-
In the general case,
w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big] \nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)
-
In the linear case,
w t + 1 = w t + α t [ r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ] ϕ ( s t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \phi^T(s_{t+1}) w_t - \phi^T(s_t) w_t \big] \phi(s_t) wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st)
-
- For each step ( s t , r t + 1 , s t + 1 ) (s_t, r_{t+1}, s_{t+1}) (st,rt+1,st+1), do
Selection of function approximators
-
Function selection
-
The first approach, which was widely used before , is to use a linear function
v ^ ( s , w ) = ϕ T ( s ) w \hat v(s, w) = \phi^T(s) w v^(s,w)=ϕT(s)w
Here, ϕ ( s ) \phi(s) ϕ(s) is the feature vector, which can be a polynomial basis, Fourier basis, ... .
-
The second approach, which is widely used nowadays , is to use a neural network as a nonlinear function approximator. The input of the NN is the state, the output is v ^ ( s , w ) \hat v(s,w) v^(s,w), and the network parameter is w w w.
-
-
TD-Linear
-
In the linear case where v ^ ( s , w ) = ϕ T ( s ) w \hat v(s, w) = \phi^T(s) w v^(s,w)=ϕT(s)w, we have
∇ w v ^ ( s , w ) = ϕ ( s ) . \nabla_w \hat v(s, w) = \phi(s). ∇wv^(s,w)=ϕ(s).
-
Substituting the gradient into the TD algorithm
w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big] \nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)
-
yields
w t + 1 = w t + α t [ r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ] ϕ ( s t ) , w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \phi^T(s_{t+1}) w_t - \phi^T(s_t) w_t \big] \phi(s_t), wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st),
-
which is the algorithm of TD learning with linear function approximation (TD-Linear).
-
-
Disadvantages and Advantages of linear function approximation
- Disadvantages of linear function approximation:
- Difficult to select appropriate feature vectors.
- Advantages of linear function approximation:
- The theoretical properties of the TD algorithm in the linear case can be much better understood than in the nonlinear case.
- Linear function approximation is still powerful in the sense that the tabular representation is merely a special case of linear function approximation.
- Disadvantages of linear function approximation:
-
Tabular representation as a special case of linear function approximation
We next show that the tabular representation is a special case of linear function approximation.
-
First, consider the special feature vector for state s s s:
ϕ ( s ) = e s ∈ R ∣ S ∣ , \phi(s) = e_s \in \mathbb{R}^{|\mathcal{S}|}, ϕ(s)=es∈R∣S∣,
-where e s e_s es is a vector with the s s s-th entry as 1 1 1 and the others as 0 0 0.
-
In this case,
v ^ ( s , w ) = e s T w = w ( s ) , \hat v(s, w) = e_s^T w = w(s), v^(s,w)=esTw=w(s),
- where w ( s ) w(s) w(s) is the s s s-th entry of w w w.
-
-
Connection with Tabular TD
-
Recall that the TD-Linear algorithm is
w t + 1 = w t + α t [ r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ] ϕ ( s t ) . w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \phi^T(s_{t+1}) w_t - \phi^T(s_t) w_t \big] \phi(s_t). wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st).
-
When ϕ ( s t ) = e s \phi(s_t) = e_s ϕ(st)=es, the above algorithm becomes
w t + 1 = w t + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) e s t . w_{t+1} = w_t + \alpha_t \big( r_{t+1} + \gamma w_t(s_{t+1}) - w_t(s_t) \big) e_{s_t}. wt+1=wt+αt(rt+1+γwt(st+1)−wt(st))est.
- This is a vector equation that merely updates the s t s_t stth entry of w t w_t wt.
-
Multiplying e s t T e_{s_t}^T estT on both sides of the equation gives
w t + 1 ( s t ) = w t ( s t ) + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) , w_{t+1}(s_t) = w_t(s_t) + \alpha_t \big( r_{t+1} + \gamma w_t(s_{t+1}) - w_t(s_t) \big), wt+1(st)=wt(st)+αt(rt+1+γwt(st+1)−wt(st)),
-
which is exactly the tabular TD algorithm.
-
方法选择
该用线性还是神经网络?
线性逼近 : v ^ ( s , w ) = ϕ ( s ) ⊤ w \hat v(s,w)=\phi(s)^\top w v^(s,w)=ϕ(s)⊤w
你通过手工设计的特征 ϕ ( s ) \phi(s) ϕ(s)(多项式、Fourier、tile coding、one-hot...)把状态映射到低维,再学一个权重向量 w w w。
- 何时优先用 :
- 状态空间不大或可良好表征;
- 追求可解释性与收敛保证(尤其 on-policy 情形);
- 算力或数据有限,需要稳健、低方差的学习器。
非线性逼近(NN) :直接学 v ^ ( s , w ) = NN ( s ; w ) \hat v(s,w)=\text{NN}(s;w) v^(s,w)=NN(s;w)。
- 何时优先用 :
- 原始状态是高维/非线性(图像、文本、复杂传感器);
- 目标是端到端的深度 RL;
- 有足够数据与算力,且可以接受训练不稳定时的调参成本。
TD-Linear 的本质:半梯度 + 投影不动点
线性情形 ∇ w v ^ ( s , w ) = ϕ ( s ) \nabla_w \hat v(s,w)=\phi(s) ∇wv^(s,w)=ϕ(s)。把它代入 TD(0) 更新:
w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ϕ ( s t ) . w_{t+1}=w_t+\alpha_t\Big[r_{t+1}+\gamma\,\hat v(s_{t+1},w_t)-\hat v(s_t,w_t)\Big]\;\phi(s_t). wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]ϕ(st).
这是半梯度(semi-gradient)方法:把 TD target r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1}+\gamma \hat v(s_{t+1},w_t) rt+1+γv^(st+1,wt) 当作常数 来对 w w w 求导(不反向传播到下一状态的 v ^ \hat v v^ 里),否则会得到"全梯度"算法,实践中反而更容易不稳定。
从几何视角 看,它在求解投影 Bellman 方程:
Φ w ≈ Π d π T π ( Φ w ) , \Phi w \approx \Pi_{d_\pi}\, \mathcal T_\pi(\Phi w), Φw≈ΠdπTπ(Φw),
- 其中 Φ \Phi Φ 的列空间是特征张成的函数类, Π d π \Pi_{d_\pi} Πdπ 是按stationary distribution d π d_\pi dπ 的最小二乘投影。
- 含义:环境让你往 T π v \mathcal T_\pi v Tπv 方向走,但你只能留在"可表示"的子空间里,于是把它投影回去。
收敛性 (指 on-policy、线性、适当步长): T π \mathcal T_\pi Tπ 是压缩映射,半梯度 TD(0) 在很多条件下收敛到上述投影不动点。
表格(tabular)为什么是线性逼近的特例?
取 one-hot 特征: ϕ ( s ) = e s \phi(s)=e_s ϕ(s)=es。则
v ^ ( s , w ) = e s ⊤ w = w ( s ) , \hat v(s,w)=e_s^\top w = w(s), v^(s,w)=es⊤w=w(s),
- 也就是"每个状态一个参数"。
代回 TD-Linear:
w t + 1 = w t + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) e s t . w_{t+1}=w_t+\alpha_t\big(r_{t+1}+\gamma w_t(s_{t+1})-w_t(s_t)\big)e_{s_t}. wt+1=wt+αt(rt+1+γwt(st+1)−wt(st))est.
仅 s t s_t st 这一维被更新,左乘 e s t ⊤ e_{s_t}^\top est⊤ 得
w t + 1 ( s t ) = w t ( s t ) + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) , w_{t+1}(s_t)=w_t(s_t)+\alpha_t\big(r_{t+1}+\gamma w_t(s_{t+1})-w_t(s_t)\big), wt+1(st)=wt(st)+αt(rt+1+γwt(st+1)−wt(st)),
这正是表格 TD(0)。
结论:表格 = 线性逼近 + one-hot 特征。
和目标函数 J(w) 的关系
线性 on-policy 情况下,半梯度 TD 并不是直接最小化
J ( w ) = E S ∼ d π [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] , J(w)=\mathbb E_{S\sim d_\pi}\!\big[(v_\pi(S)-\hat v(S,w))^2\big], J(w)=ES∼dπ[(vπ(S)−v^(S,w))2],
- 而是逼近投影 Bellman 解 。两者在一般情况下并不相同,但在很多问题上,这个解既可计算 又效果好。
若你确实想"真最小化" J ( w ) J(w) J(w),需要能访问 v π v_\pi vπ 或用 MC 回报近似它,此时会回到 MC+函数逼近 的更新
w t + 1 = w t + α t ( g t − v ^ ( s t , w t ) ) ϕ ( s t ) , w_{t+1}=w_t+\alpha_t\,(g_t-\hat v(s_t,w_t))\,\phi(s_t), wt+1=wt+αt(gt−v^(st,wt))ϕ(st),
- 方差更大,但目标一致。
Theoretical analysis
-
The algorithm
w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big] \nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)
does not minimize the following objective function:
J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] J(w) = \mathbb{E}\!\big[ ( v_\pi(S) - \hat v(S, w) )^2 \big] J(w)=E[(vπ(S)−v^(S,w))2]
-
Different objective functions
-
Objective function 1: True value error
$J_E(w) = \mathbb{E}\!\big[ ( v_\pi(S) - \hat v(S, w) )^2 \big]
= | \hat v(w) - v_\pi |^2_D$
-
Objective function 2: Bellman error
J B E ( w ) = ∥ v ^ ( w ) − ( r π + γ P π v ^ ( w ) ) ∥ D 2 ≐ ∥ v ^ ( w ) − T π ( v ^ ( w ) ) ∥ D 2 J_{BE}(w) = \| \hat v(w) - (r_\pi + \gamma P_\pi \hat v(w)) \|^2_D \doteq \| \hat v(w) - T_\pi(\hat v(w)) \|^2_D JBE(w)=∥v^(w)−(rπ+γPπv^(w))∥D2≐∥v^(w)−Tπ(v^(w))∥D2
-
where
T π ( x ) ≐ r π + γ P π x T_\pi(x) \doteq r_\pi + \gamma P_\pi x Tπ(x)≐rπ+γPπx
-
-
Objective function 3: Projected Bellman error
J P B E ( w ) = ∥ v ^ ( w ) − M T π ( v ^ ( w ) ) ∥ D 2 J_{PBE}(w) = \| \hat v(w) - M T_\pi(\hat v(w)) \|^2_D JPBE(w)=∥v^(w)−MTπ(v^(w))∥D2
- where M M M is a projection matrix.
-
算法间的差距
True value error
J E ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] = ∥ v ^ ( w ) − v π ∥ D 2 J_E(w) = \mathbb{E}\!\big[(v_\pi(S) - \hat v(S, w))^2\big] = \| \hat v(w) - v_\pi \|^2_D JE(w)=E[(vπ(S)−v^(S,w))2]=∥v^(w)−vπ∥D2
- 含义:直接最小化近似值函数 v ^ ( s , w ) \hat v(s,w) v^(s,w) 与真实值函数 v π ( s ) v_\pi(s) vπ(s) 的差距。
- 理想目标:这是最自然、最直观的优化目标(类似 supervised learning)。
- 问题:我们 不知道 v π ( s ) v_\pi(s) vπ(s),只能通过采样和 Bellman 方程间接近似,因此无法直接最小化这个目标。
Bellman error
J B E ( w ) = ∥ v ^ ( w ) − T π ( v ^ ( w ) ) ∥ D 2 J_{BE}(w) = \| \hat v(w) - T_\pi(\hat v(w)) \|^2_D JBE(w)=∥v^(w)−Tπ(v^(w))∥D2
其中
T π ( x ) = r π + γ P π x T_\pi(x) = r_\pi + \gamma P_\pi x Tπ(x)=rπ+γPπx
含义:衡量 v ^ ( s , w ) \hat v(s,w) v^(s,w) 与 Bellman 方程的不一致程度。
- Bellman 方程的固定点就是 v π v_\pi vπ。
- 如果 v ^ ( w ) \hat v(w) v^(w) 落在"完美的"函数空间里,最小化 Bellman 误差就能得到真实值函数。
- 问题:当函数近似器(如线性函数或神经网络)不能精确表示 v π v_\pi vπ 时,直接最小化 Bellman 误差可能会得到"发散"的解。
Projected Bellman error
J P B E ( w ) = ∥ v ^ ( w ) − M T π ( v ^ ( w ) ) ∥ D 2 J_{PBE}(w) = \| \hat v(w) - M T_\pi(\hat v(w)) \|^2_D JPBE(w)=∥v^(w)−MTπ(v^(w))∥D2
- 其中 M 是一个投影矩阵,把 Bellman 更新结果 T π ( v ^ ) T_\pi(\hat v) Tπ(v^) 投影回函数近似空间。
- 含义:由于函数近似空间(比如线性函数空间)通常比较有限,我们无法保 证 v ^ ( w ) 证 \hat v(w) 证v^(w) 可以完美满足 Bellman 方程。所以我们要求 投影后的 Bellman 更新结果 尽可能接近 v ^ ( w ) \hat v(w) v^(w)。
- 本质:寻找一个在函数空间中"最接近 Bellman 固定点"的近似解。
- 重要性:这是 TD-Linear 实际优化的目标 。TD 的更新过程相当于隐式地做了这个投影,因此收敛点是最小化 projected Bellman error 的解,而不是直接的 true value error。
为什么 TD-Linear 对应 projected Bellman error?
TD 更新:
w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big]\nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)
这里的更新使用的是 TD target r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1} + \gamma \hat v(s_{t+1}, w_t) rt+1+γv^(st+1,wt),它相当于把 Bellman 更新结果 T π ( v ^ ) T_\pi(\hat v) Tπ(v^) 投影回函数近似空间。
因此,TD 不会直接最小化 J E ( w ) J_E(w) JE(w) 或 J B E ( w ) J_{BE}(w) JBE(w),而是最小化 projected Bellman error。
总结:
- True value error:最理想的目标,但无法直接计算。
- Bellman error:衡量与 Bellman 方程的不一致,但函数逼近时可能不稳定。
- Projected Bellman error:TD 实际最小化的目标,在逼近空间内找到最合理的解,保证了收敛性。
Sarsa & Q-learning with function approximation
Sarsa with function approximation
Sarsa algorithm
-
So far, we merely considered the problem of state value estimation. That is we hope
v ^ ≈ v π \hat v \approx v_\pi v^≈vπ
-
To search for optimal policies, we need to estimate action values.
-
The Sarsa algorithm with value function approximation is
w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) . w_{t+1} = w_t + \alpha_t \Big[r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t)\Big] \nabla_w \hat q(s_t, a_t, w_t). wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt).
-
This is the same as the algorithm we introduced previously in this lecture except that v ^ \hat v v^ is replaced by q ^ \hat q q^.
Pseudocode: Sarsa with function approximation
- Aim : Search a policy that can lead the agent to the target from an initial state-action pair ( s 0 , a 0 ) (s_0, a_0) (s0,a0).
- For each episode, do
- If the current s t s_t st is not the target state, do
-
Take action a t a_t at following π t ( s t ) \pi_t(s_t) πt(st), generate r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1, and then take action a t + 1 a_{t+1} at+1 following π t ( s t + 1 ) \pi_t(s_{t+1}) πt(st+1)
-
Value update (parameter update):
w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) w_{t+1} = w_t + \alpha_t \Big[r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t)\Big]\nabla_w \hat q(s_t, a_t, w_t) wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt)
-
Policy update:
π t + 1 ( a ∣ s t ) = 1 − ε ∣ A ( s ) ∣ ( ∣ A ( s ) ∣ − 1 ) if a = arg max a ∈ A ( s t ) q ^ ( s t , a , w t + 1 ) \pi_{t+1}(a|s_t) = 1 - \frac{\varepsilon}{|\mathcal{A}(s)|} (|\mathcal{A}(s)| - 1) \quad \text{if } a = \arg\max_{a \in \mathcal{A}(s_t)} \hat q(s_t, a, w_{t+1}) πt+1(a∣st)=1−∣A(s)∣ε(∣A(s)∣−1)if a=argmaxa∈A(st)q^(st,a,wt+1)
π t + 1 ( a ∣ s t ) = ε ∣ A ( s ) ∣ otherwise \pi_{t+1}(a|s_t) = \frac{\varepsilon}{|\mathcal{A}(s)|} \quad \text{otherwise} πt+1(a∣st)=∣A(s)∣εotherwise
-
- If the current s t s_t st is not the target state, do
Sarsa with function approximation
公式:
w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) . w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t). wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt).
含义:
这里 q ^ ( s , a , w ) \hat q(s, a, w) q^(s,a,w) 是 近似的 action-value function ,用参数 w w w 来表示(比如线性函数或神经网络)。
TD target 使用的是 实际执行的下一步动作 a t + 1 a_{t+1} at+1:
r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) rt+1+γq^(st+1,at+1,wt)
这意味着 Sarsa 是 on-policy 算法:
- 行为策略 π \pi π 既用于生成数据(选择 a t a_t at, a t + 1 a_{t+1} at+1),
- 也用于更新价值函数。
更新方向由 TD error 决定:
δ t = r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) \delta_t = r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t) δt=rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)
- 然后对当前 q ^ ( s t , a t , w t ) \hat q(s_t, a_t, w_t) q^(st,at,wt) 的参数做梯度修正。
Q-learning with function approximation
Q-learning algorithm
-
Similar to Sarsa, tabular Q-learning can also be extended to the case of value function approximation.
-
The q-value update rule is
w t + 1 = w t + α t [ r t + 1 + γ max a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) , w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t), wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt),
-
which is the same as Sarsa except that q ^ ( s t + 1 , a t + 1 , w t ) \hat q(s_{t+1}, a_{t+1}, w_t) q^(st+1,at+1,wt) is replaced by max a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) maxa∈A(st+1)q^(st+1,a,wt).
Pseudocode: Q-learning with function approximation (on-policy version)
- Initialization: Initial parameter vector w 0 w_0 w0. Initial policy π 0 \pi_0 π0. Small ε > 0 \varepsilon > 0 ε>0.
- Aim: Search a good policy that can lead the agent to the target from an initial state-action pair ( s 0 , a 0 ) (s_0, a_0) (s0,a0).
- For each episode, do
- If the current s t s_t st is not the target state, do
-
Take action a t a_t at following π t ( s t ) \pi_t(s_t) πt(st), and generate r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1
-
Value update (parameter update):
w t + 1 = w t + α t [ r t + 1 + γ max a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t) wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt)
-
Policy update:
π t + 1 ( a ∣ s t ) = 1 − ε ∣ A ( s ) ∣ ( ∣ A ( s ) ∣ − 1 ) if a = arg max a ∈ A ( s t ) q ^ ( s t , a , w t + 1 ) \pi_{t+1}(a|s_t) = 1 - \frac{\varepsilon}{|\mathcal{A}(s)|} (|\mathcal{A}(s)| - 1) \quad \text{if } a = \arg\max_{a \in \mathcal{A}(s_t)} \hat q(s_t, a, w_{t+1}) πt+1(a∣st)=1−∣A(s)∣ε(∣A(s)∣−1)if a=argmaxa∈A(st)q^(st,a,wt+1)
π t + 1 ( a ∣ s t ) = ε ∣ A ( s ) ∣ otherwise \pi_{t+1}(a|s_t) = \frac{\varepsilon}{|\mathcal{A}(s)|} \quad \text{otherwise} πt+1(a∣st)=∣A(s)∣εotherwise
-
- If the current s t s_t st is not the target state, do
Q-learning with function approximation
公式:
w t + 1 = w t + α t [ r t + 1 + γ max a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) . w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t). wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt).
含义:
同样 q ^ ( s , a , w ) \hat q(s, a, w) q^(s,a,w) 是参数化的 Q 函数。
TD target 使用的是 下一状态所有可能动作的最大值:
r t + 1 + γ max a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)
这意味着 Q-learning 是 off-policy 算法:
- 行为策略(behavior policy)可以是探索性的,比如 ϵ \epsilon ϵ-greedy。
- 但是更新时假设 agent 永远选择"最优动作",因为取了 max \max max。
更新方式仍然基于 TD error:
δ t = r t + 1 + γ max a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) \delta_t = r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) δt=rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)
- 再做参数更新。
Deep Q-learning
Objective function
-
Definition
-
Deep Q-learning aims to minimize the objective function/loss function:
J ( w ) = E [ ( R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] , J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right], J(w)=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2],
- where ( S , A , R , S ' ) (S, A, R, S') (S,A,R,S') are random variables.
-
This is actually the Bellman optimality error.
-
That is because
q ( s , a ) = E [ R t + 1 + γ max a ∈ A ( S t + 1 ) q ( S t + 1 , a ) ∣ S t = s , A t = a ] , ∀ s , a q(s, a) = \mathbb{E}\Big[ R_{t+1} + \gamma \max_{a \in \mathcal{A}(S_{t+1})} q(S_{t+1}, a) \,\Big|\, S_t = s, A_t = a \Big], \quad \forall s, a q(s,a)=E[Rt+1+γmaxa∈A(St+1)q(St+1,a) St=s,At=a],∀s,a
-
The value of
R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w)
-
should be zero in the expectation sense.
-
-
-
How to minimize the objective function? Gradient-descent!
-
In this objective function
J ( w ) = E [ ( R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] , J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right], J(w)=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2],
-
the parameter w not only appears in q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w) but also in
y ≐ R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w ) . y \doteq R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w). y≐R+γmaxa∈A(S')q^(S',a,w).
-
-
For the sake of simplicity, we can assume that w w w in y y y is fixed (at least for a while) when we calculate the gradient.
-
Deep Q-learning 的目标函数与 Bellman optimality error 的关系:
Deep Q-learning 的目标函数
J ( w ) = E [ ( R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right] J(w)=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2]
- 其中:
- R R R:当前状态执行动作 A A A 后得到的奖励;
- S ' S' S':下一个状态;
- q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w):由神经网络(带参数 w w w)给出的 Q Q Q 值近似;
- max a ∈ A ( S ' ) \max_{a \in \mathcal{A}(S')} maxa∈A(S'):在下一个状态选择最优动作对应的 Q Q Q 值。
- 这个目标函数就是一个 均方误差 (MSE) ,它度量的是 Q Q Q 网络的输出 q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w) 与目标值 R + γ max a q ^ ( S ' , a , w ) R + \gamma \max_{a} \hat{q}(S', a, w) R+γmaxaq^(S',a,w) 之间的差距。
为什么它对应 Bellman optimality error?
Bellman 最优方程定义了最优 Q Q Q 值的递推关系:
q ( s , a ) = E [ R t + 1 + γ max a ' ∈ A ( S t + 1 ) q ( S t + 1 , a ' ) ∣ S t = s , A t = a ] q(s, a) = \mathbb{E}\Big[ R_{t+1} + \gamma \max_{a' \in \mathcal{A}(S_{t+1})} q(S_{t+1}, a') \,\Big|\, S_t = s, A_t = a \Big] q(s,a)=E[Rt+1+γmaxa'∈A(St+1)q(St+1,a') St=s,At=a]
换句话说,如果 q ^ \hat{q} q^ 是最优的,那么:
R + γ max a q ^ ( S ' , a , w ) − q ^ ( S , A , w ) = 0 R + \gamma \max_{a} \hat{q}(S', a, w) - \hat{q}(S, A, w) = 0 R+γmaxaq^(S',a,w)−q^(S,A,w)=0
- 在期望意义下应该完全成立。
但是实际中 q ^ \hat{q} q^ 是近似函数,所以它并不能完全满足 Bellman 方程。
- 于是我们就把这个残差(差距)定义为 Bellman optimality error,并通过最小化它来逼近最优 Q 值。
为什么优化比较 tricky?
在损失函数
J ( w ) = E [ ( R + γ max a q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right] J(w)=E[(R+γmaxaq^(S',a,w)−q^(S,A,w))2]
- 里面,参数 w w w 既出现在当前的 q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w),也出现在目标值 R + γ max a q ^ ( S ' , a , w ) R + \gamma \max_{a} \hat{q}(S', a, w) R+γmaxaq^(S',a,w) 中。
- 这就导致梯度计算比较复杂,因为我们同时要对 预测值 和 目标值 求梯度。
为简化计算,DQN 通常采用 固定目标网络(target network) 的方法:
- 在一段时间内,把目标部分 R + γ max a q ^ ( S ' , a , w ) R + \gamma \max_{a} \hat{q}(S', a, w) R+γmaxaq^(S',a,w) 的参数 w w w 固定;
- 只更新当前 Q 网络的参数。
这样就可以避免梯度传播的复杂性。
Two networks
- Introduction
-
One is a main network representing q ^ ( s , a , w ) \hat q(s,a,w) q^(s,a,w)
-
The other is a target network q ^ ( s , a , w T ) \hat q(s,a,w_T) q^(s,a,wT).
-
The objective function in this case degenerates to
J = E [ ( R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) 2 ] , J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)^2\Big], J=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))2],
- where w T w_T wT is the target network parameter.
-
- Gradient with fixed target network
-
When w T w_T wT is fixed, the gradient of J J J can be easily obtained as
∇ w J = E [ ( R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) ∇ w q ^ ( S , A , w ) ] . \nabla_w J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)\nabla_w \hat q(S,A,w)\Big]. ∇wJ=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))∇wq^(S,A,w)].
-
The basic idea of deep Q-learning is to use the gradient-descent algorithm to minimize the objective function.
-
在 DQN 里,如果只用一个网络 q ^ ( s , a , w ) \hat q(s,a,w) q^(s,a,w) 来估计 Q Q Q 值并同时更新参数,会遇到 训练不稳定 的问题。原因是目标值(TD target)和估计值(prediction)都依赖于同一个网络,参数更新会相互干扰。
解决办法:
- 引入 两个网络 :
- Main network : q ^ ( s , a , w ) \hat q(s,a,w) q^(s,a,w),用来学习和更新参数。
- Target network : q ^ ( s , a , w T ) \hat q(s,a,w_T) q^(s,a,wT),用来生成相对稳定的目标值。
- 参数 w T w_T wT 会定期从 w w w 同步(例如每隔 C C C 步复制一次)。
- 这样目标值不会在每一步都随 w w w 的更新而改变,从而降低训练震荡。
目标函数:
J = E [ ( R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) 2 ] J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)^2\Big] J=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))2]
- 当前 Q 值估计: q ^ ( S , A , w ) \hat q(S,A,w) q^(S,A,w)
- 目标 Q 值(TD target): R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w T ) R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) R+γmaxa∈A(S')q^(S',a,wT)
梯度下降更新:
∇ w J = E [ ( R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) ∇ w q ^ ( S , A , w ) ] \nabla_w J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)\nabla_w \hat q(S,A,w)\Big] ∇wJ=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))∇wq^(S,A,w)]
- 当 w T w_T wT 固定时,梯度计算非常清晰,不会被目标值同时更新而扰动。
总结
- Two networks (Main & Target)解决了 目标值不稳定 的问题。
Two techniques
- First technique: Two networks, a main network and a target network
- Why is it used?
- The mathematical reason has been explained when we calculate the gradient.
- Implementation details:
- Let w w w and w T w_T wT denote the parameters of the main and target networks, respectively. They are set to be the same initially.
- In every iteration, we draw a mini-batch of samples ( s , a , r , s ' ) {(s,a,r,s')} (s,a,r,s') from the replay buffer (will be explained later).
- The inputs of the networks include state s s s and action a a a.
-
The target output is
y T ≐ r + γ max a ∈ A ( s ' ) q ^ ( s ' , a , w T ) . y_T \doteq r + \gamma \max_{a\in \mathcal{A}(s')} \hat q(s',a,w_T). yT≐r+γmaxa∈A(s')q^(s',a,wT).
-
Then, we directly minimize the TD error or called loss function
( y T − q ^ ( s , a , w ) ) 2 (y_T - \hat q(s,a,w))^2 (yT−q^(s,a,w))2
- over the mini-batch ( s , a , y T ) {(s,a,y_T)} (s,a,yT).
-
- Why is it used?
- Another technique: Experience replay
-
Question: What is experience replay?
-
Answer:
- After we have collected some experience samples, we do NOT use these samples in the order they were collected.
- Instead, we store them in a set, called replay buffer B ≐ ( s , a , r , s ' ) \mathcal{B} \doteq {(s,a,r,s')} B≐(s,a,r,s').
- Every time we train the neural network, we can draw a mini-batch of random samples from the replay buffer.
- The draw of samples, or called experience replay , should follow a uniform distribution (why?).
-
Question: Why is experience replay necessary in deep Q-learning? Why does the replay must follow a uniform distribution?
-
Answer: The answers lie in the objective function.
J = E [ ( R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J = \mathbb{E}\left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) \right)^2 \right] J=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2]
- ( S , A ) ∼ d : ( S , A ) (S, A) \sim d: (S, A) (S,A)∼d:(S,A) is an index and treated as a single random variable
- R ∼ p ( R ∣ S , A ) , S ' ∼ p ( S ' ∣ S , A ) : R R \sim p(R|S,A), S' \sim p(S'|S,A): R R∼p(R∣S,A),S'∼p(S'∣S,A):R and S S S are determined by the system model.
- The distribution of the state-action pair ( S , A ) (S, A) (S,A) is assumed to be uniform.
- However, the samples are not uniformly collected because they are generated consequently by certain policies.
- To break the correlation between consequent samples, we can use the experience replay technique by uniformly drawing samples from the replay buffer.
- This is the mathematical reason why experience replay is necessary and why the experience replay must be uniform.
-
Experience replay (经验回放)
问题:
- 如果我们按顺序使用交互数据来更新网络,样本之间是强相关的(例如 s t , s t + 1 , s t + 2 s_t, s_{t+1}, s_{t+2} st,st+1,st+2),不符合随机采样的假设,会导致训练不稳定甚至发散。
解决办法:
- 引入 Replay Buffer B \mathcal{B} B 来存储过往的经验 ( s , a , r , s ' ) (s,a,r,s') (s,a,r,s')。
- 每次训练时,不是直接用最近的数据,而是 随机抽取一个 mini-batch 来打破样本相关性。
数学解释:
目标函数为:
J = E [ ( R + γ max a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J = \mathbb{E}\left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) \right)^2 \right] J=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2]
- ( S , A ) ∼ d (S,A) \sim d (S,A)∼d: ( S , A ) (S,A) (S,A) 被看作一个随机变量
- 理论上, ( S , A ) (S,A) (S,A) 的分布应该是 均匀的
- 但实际收集的数据由当前策略产生,不是均匀分布的(可能更集中在某些区域)
经验回放的作用:
- 打破样本相关性(避免梯度更新时出现偏差)。
- 近似均匀采样 ,使得 ( S , A ) (S,A) (S,A) 的经验分布接近理论假设的均匀分布。
- 提高数据利用率(同一个样本可以被多次使用)。
总结
- Experience replay 解决了 样本相关性与分布偏差 的问题。
Revisit the tabular case:
-
Question: Why does not tabular Q-learning require experience replay?
- Answer: No uniform distribution requirement.
-
Question: Why Deep Q-learning involves distribution?
-
Answer: The objective function in the deep case is a scalar average over all ( S , A ) (S, A) (S,A).
The tabular case does not involve any distribution of S S S or A A A.
The algorithm in the tabular case aims to solve a set of equations for all ( s , a ) (s,a) (s,a) (Bellman optimality equation).
-
-
Question: Can we use experience replay in tabular Q-learning?
- Answer: Yes, we can. And more sample efficient (why?).
为什么 tabular Q-learning 不需要经验回放,而 deep Q-learning 需要
Tabular Q-learning 的特点
存储方式 :每个状态-动作对 ( s , a ) (s,a) (s,a) 都有一个对应的 Q Q Q 值表项 Q ( s , a ) Q(s,a) Q(s,a)。
更新方式 :更新是 局部的 ,只影响当前的 ( s , a ) (s,a) (s,a):
Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ' Q ( s ' , a ' ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha \Big[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \Big] Q(s,a)←Q(s,a)+α[r+γmaxa'Q(s',a')−Q(s,a)]
无分布要求:
- 在表格方法中,我们实际上是在"解方程"(Bellman 方程组),只要所有 ( s , a ) (s,a) (s,a) 都被访问到,无论采样分布是否均匀,最终都能收敛到最优解 Q ∗ Q^* Q∗。
- 因此,不需要对采样分布做均匀化的要求,也就不需要经验回放来打破采样的相关性。
Deep Q-learning 的特点
存储方式 : Q Q Q 值不是用表格存储,而是用 神经网络近似:
Q ( s , a ; w ) ≈ Q ∗ ( s , a ) Q(s,a;w) \approx Q^*(s,a) Q(s,a;w)≈Q∗(s,a)
- 参数 w w w 是共享的,因此一次更新会影响 所有 ( s , a ) (s,a) (s,a) 的估计,而不是仅仅一个表项。
目标函数:
深度 Q 学习的目标函数是一个 均方误差 (MSE):
J ( w ) = E [ ( r + γ max a ' Q ( s ' , a ' ; w ) − Q ( s , a ; w ) ) 2 ] J(w) = \mathbb{E}\Big[\big(r + \gamma \max_{a'} Q(s',a';w) - Q(s,a;w)\big)^2\Big] J(w)=E[(r+γmaxa'Q(s',a';w)−Q(s,a;w))2]
注意:这里的期望是对 状态-动作对 ( s , a ) (s,a) (s,a) 的分布 取的。
分布问题:
- 如果训练样本高度相关(例如连续从同一个 episode 采样),网络会过拟合局部轨迹,梯度估计偏差很大。
- 目标函数隐含假设 ( S , A ) (S,A) (S,A) 是独立同分布 (i.i.d.),但实际 RL 环境中采样是序列相关的。
为什么 Deep Q-learning 需要经验回放
- 经验回放 (Experience Replay) 做了两件事:
- 打破相关性:从 replay buffer 中均匀采样,打乱序列相关性,近似满足 i.i.d. 假设。
- 提高样本利用率:同一个样本可以被多次采样更新,而不是一次性丢弃。
- 数学解释 :
如果我们不使用经验回放,那么在计算期望时:
( S , A ) ∼ d π (S,A) \sim d_\pi (S,A)∼dπ
这个分布 d π d_\pi dπ 会强烈依赖于当前策略和轨迹,导致梯度估计不稳定。
使用 replay buffer 并均匀采样后,可以近似模拟出一个 接近均匀的采样分布,稳定训练。
对比总结
- Tabular Q-learning :更新是局部的,不需要采样分布均匀性,只要覆盖所有 ( s , a ) (s,a) (s,a),就能收敛。
- Deep Q-learning:更新是全局的,依赖于目标函数的期望分布,需要经验回放来保证样本分布近似均匀,避免梯度偏差。
Pseudocode: Deep Q-learning (off-policy version)
- Aim: Learn an optimal target network to approximate the optimal action values from the experience samples generated by a behavior policy π b \pi_b πb.
- Store the experience samples generated by π b \pi_b πb in a replay buffer B = { ( s , a , r , s ' ) } \mathcal{B} = \{(s,a,r,s')\} B={(s,a,r,s')}
-
For each iteration, do
-
Uniformly draw a mini-batch of samples from B \mathcal{B} B
-
For each sample ( s , a , r , s ' ) (s,a,r,s') (s,a,r,s'), calculate the target value as
y T = r + γ max a ∈ A ( s ' ) q ^ ( s ' , a , w T ) , y_T = r + \gamma \max_{a \in \mathcal{A}(s')} \hat q(s',a,w_T), yT=r+γmaxa∈A(s')q^(s',a,wT),
- where w T w_T wT is the parameter of the target network
-
-
Update the main network to minimize
( y T − q ^ ( s , a , w ) ) 2 (y_T - \hat q(s,a,w))^2 (yT−q^(s,a,w))2
- using the mini-batch { ( s , a , y T ) } \{(s,a,y_T)\} {(s,a,yT)}
-
Set w T = w w_T = w wT=w every C C C iterations
-
总结
从 表格方法 (Tabular Q-learning) 到 函数逼近 (Sarsa/Q-learning with function approximation),再到 深度强化学习 (DQN),核心都是最小化 Bellman 误差,区别在于:
- 表格方法直接解方程,不依赖样本分布;
- 函数逼近需要引入梯度下降与投影;
- 深度 Q-learning 则通过 目标网络 (Target Network) 和 经验回放 (Experience Replay) 稳定训练神经网络近似器。