RL【8】:Value Function Approximation

系列文章目录

Fundamental Tools

RL【1】:Basic Concepts
RL【2】:Bellman Equation
RL【3】:Bellman Optimality Equation

Algorithm

RL【4】:Value Iteration and Policy Iteration
RL【5】:Monte Carlo Learning
RL【6】:Stochastic Approximation and Stochastic Gradient Descent

Method

RL【7-1】:Temporal-difference Learning
RL【7-2】:Temporal-difference Learning


文章目录

  • 系列文章目录
  • 前言
  • [Algorithm for state value estimation](#Algorithm for state value estimation)
    • [Objective function](#Objective function)
    • [Optimization algorithms](#Optimization algorithms)
    • [Selection of function approximators](#Selection of function approximators)
    • [Theoretical analysis](#Theoretical analysis)
  • [Sarsa & Q-learning with function approximation](#Sarsa & Q-learning with function approximation)
    • [Sarsa with function approximation](#Sarsa with function approximation)
    • [Q-learning with function approximation](#Q-learning with function approximation)
  • [Deep Q-learning](#Deep Q-learning)
  • 总结

前言

本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:

B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】

GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning


Algorithm for state value estimation

Objective function

Formal introduction

  • Let v π ( s ) v_\pi(s) vπ(s) and v ^ ( s , w ) \hat v(s,w) v^(s,w) be the true state value and a function for approximation.
  • Our goal is to find an optimal w so that v ^ ( s , w ) \hat v(s,w) v^(s,w) can best approximate v π ( s ) v_\pi(s) vπ(s) for every s s s.
  • This is a policy evaluation problem. Later we will extend to policy improvement.
  • To find the optimal w w w, we need two steps .
    • The first step is to define an objective function.
    • The second step is to derive algorithms optimizing the objective function.

问题背景:函数近似的价值估计

  • 在实际强化学习中,状态空间可能非常大(甚至连续),没法为每个状态单独存储一个 v π ( s ) v_\pi(s) vπ(s)。
  • 因此我们需要用一个函数近似器(比如线性函数、神经网络) v ^ ( s , w ) \hat v(s,w) v^(s,w) 来逼近真实的 v π ( s ) v_\pi(s) vπ(s)。
  • 目标 :找到一组参数 w w w,使得 v ^ ( s , w ) \hat v(s,w) v^(s,w) 尽可能接近 v π ( s ) v_\pi(s) vπ(s)。

Objective function

J ( w ) = E  ⁣ [ (   v π ( S ) − v ^ ( S , w )   ) 2 ] . J(w)=\mathbb{E}\!\left[(\,v_\pi(S)-\hat v(S,w)\,)^2\right]. J(w)=E[(vπ(S)−v^(S,w))2].

  • Our goal is to find the best w that can minimize J ( w ) J(w) J(w).
  • The expectation is with respect to the random variable S ∈ S S\in\mathcal{S} S∈S.

Several ways to define the probability distribution of S S S

  • The first way is to use a uniform distribution
    • That is to treat all the states to be equally important by setting the probability of each state as 1 / ∣ S ∣ 1/|\mathcal{S}| 1/∣S∣.

    • In this case, the objective function becomes

      J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] = 1 ∣ S ∣ ∑ s ∈ S ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \mathbb{E}[(v_\pi(S) - \hat v(S, w))^2] = \frac{1}{|\mathcal{S}|} \sum_{s \in \mathcal{S}} (v_\pi(s) - \hat v(s, w))^2. J(w)=E[(vπ(S)−v^(S,w))2]=∣S∣1∑s∈S(vπ(s)−v^(s,w))2.

    • Drawback:

      • The states may not be equally important. For example, some states may be rarely visited by a policy. Hence, this way does not consider the real dynamics of the Markov process under the given policy.
  • The second way is to use the stationary distribution
    • Stationary distribution is an important concept that will be frequently used in this course. In short, it describes the long-run behavior of a Markov process.

    • Let d π ( s ) s ∈ S {d_\pi(s)}{s \in \mathcal{S}} dπ(s)s∈S denote the stationary distribution of the Markov process under policy π \pi π. By definition, d π ( s ) ≥ 0 d\pi(s) \ge 0 dπ(s)≥0 and ∑ s ∈ S d π ( s ) = 1 \sum_{s \in \mathcal{S}} d_\pi(s) = 1 ∑s∈Sdπ(s)=1.

    • The objective function can be rewritten as

      J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] = ∑ s ∈ S d π ( s ) ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \mathbb{E}[(v_\pi(S) - \hat v(S, w))^2] = \sum_{s \in \mathcal{S}} d_\pi(s)(v_\pi(s) - \hat v(s, w))^2. J(w)=E[(vπ(S)−v^(S,w))2]=∑s∈Sdπ(s)(vπ(s)−v^(s,w))2.

    • This function is a weighted squared error.

    • Since more frequently visited states have higher values of d π ( s ) d_\pi(s) dπ(s), their weights in the objective function are also higher than those rarely visited states.

状态分布的选择 ------ 两种方式

关键问题:**期望 E \mathbb{E} E 是对哪个分布下的状态 S S S 取的?**这会影响训练出来的近似器"偏向于哪些状态更准确"。

  1. Uniform distribution
    • 做法:假设所有状态都等重要,给每个状态分配相同概率:

      P ( S = s ) = 1 ∣ S ∣ . P(S=s) = \frac{1}{|\mathcal{S}|}. P(S=s)=∣S∣1.

    • 目标函数变为:

      J ( w ) = 1 ∣ S ∣ ∑ s ∈ S ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \frac{1}{|\mathcal{S}|}\sum_{s\in\mathcal{S}} \big(v_\pi(s) - \hat v(s,w)\big)^2. J(w)=∣S∣1∑s∈S(vπ(s)−v^(s,w))2.

    • 优点

      • 简单直观,保证所有状态都有"平等对待"。
    • 缺点

      • 不符合实际。现实中某些状态出现得很少(比如游戏里的罕见场景),强行要求对它们也拟合得很好,会浪费模型容量。
      • 没有体现马尔可夫过程在策略 π \pi π 下的真实动态。
  2. Stationary distribution
    • 做法 :考虑在策略 π \pi π 下,环境长期运行后,每个状态出现的概率分布 d π ( s ) d_\pi(s) dπ(s)(即 stationary distribution)。

    • 目标函数变为:

      J ( w ) = ∑ s ∈ S d π ( s ) , ( v π ( s ) − v ^ ( s , w ) ) 2 . J(w) = \sum_{s \in \mathcal{S}} d_\pi(s),\big(v_\pi(s) - \hat v(s,w)\big)^2. J(w)=∑s∈Sdπ(s),(vπ(s)−v^(s,w))2.

    • 优点

      • 更贴合真实情况,因为智能体在实际运行时,会频繁遇到某些状态而几乎不会遇到另外一些。
      • 在这些"高频状态"上的估计更准确,提升实际执行效果。
    • 缺点

      • 低频状态可能拟合得很差,尤其当某些关键状态虽然重要但很少出现时。

Stationary Distribution:

  • Distribution: Distribution of the state
  • Stationary: Long-run behavior
  • Summary: after the agent runs a long time following a policy, the probability that the agent is at any state can be described by this distribution.

stationary distribution(平稳分布/稳态分布) 相关的基本概念

1. Stationary Distribution 的定义

  • 在马尔可夫过程 (Markov Process) 或马尔可夫决策过程 (MDP) 中,智能体随着时间不断转移状态。

  • Stationary distribution 指的是:当智能体运行足够长时间后,落在每个状态的概率分布。

  • 数学形式:如果 d π ( s ) s ∈ S {d_\pi(s)}{s \in \mathcal{S}} dπ(s)s∈S 是策略 π \pi π 下的 stationary distribution,则有:

    d π ( s ' ) = ∑ s ∈ S d π ( s ) ∑ a ∈ A π ( a ∣ s ) P ( s ' ∣ s , a ) , d\pi(s') = \sum_{s \in \mathcal{S}} d_\pi(s) \sum_{a \in \mathcal{A}} \pi(a|s) P(s'|s,a), dπ(s')=∑s∈Sdπ(s)∑a∈Aπ(a∣s)P(s'∣s,a),

    • 并且

      ∑ s ∈ S d π ( s ) = 1 , d π ( s ) ≥ 0. \sum_{s \in \mathcal{S}} d_\pi(s) = 1, \quad d_\pi(s) \ge 0. ∑s∈Sdπ(s)=1,dπ(s)≥0.

  1. 相关的基本概念
    1. Distribution
      • 字面意义:某个变量的概率分布。
      • 在这里是 状态分布 :即智能体处于每个状态 s ∈ S s \in \mathcal{S} s∈S 的概率。
    2. Stationary
      • 指的是 长期稳定 的状态。
      • 当时间 t → ∞ t \to \infty t→∞ 时,状态分布趋于固定值,不再随着时间波动。
      • 也就是说,状态分布收敛到了一个平衡点。
    3. Steady-state distribution / Limiting distribution
      • 同义词 :Stationary distribution 也常被称为 steady-state distribution(稳态分布)limiting distribution(极限分布)
      • 强调的是:它是一个长期极限意义下的稳定分布。
  2. 在强化学习中的意义
    1. 价值函数逼近

      • 近似方法(value function approximation) 中,我们用期望定义目标函数:

        J ( w ) = E s ∼ d π [ ( v π ( s ) − v ^ ( s , w ) ) 2 ] , J(w) = \mathbb{E}{s \sim d\pi} \big[(v_\pi(s) - \hat v(s,w))^2\big], J(w)=Es∼dπ[(vπ(s)−v^(s,w))2],

        • 其中 d π d_\pi dπ 就是 stationary distribution。
      • 这样我们对常访问的状态赋予更大权重,更符合策略的实际表现。

    2. 策略梯度 (Policy Gradient)

      • 策略梯度方法 中,性能目标函数通常写为:

        J ( π ) = ∑ s d π ( s ) ∑ a π ( a ∣ s ) q π ( s , a ) . J(\pi) = \sum_s d_\pi(s) \sum_a \pi(a|s) q_\pi(s,a). J(π)=∑sdπ(s)∑aπ(a∣s)qπ(s,a).

      • 这里 d π ( s ) d_\pi(s) dπ(s) 表示智能体在策略 π \pi π 下,长期处于状态 s s s 的概率。

      • 因此,stationary distribution 是策略梯度方法的核心组成部分。

    3. 直观解释

      • 如果智能体执行一个策略 π \pi π 很久以后:
        • 常访问的状态在 d π ( s ) d_\pi(s) dπ(s) 下概率更高;
        • 很少访问的状态概率接近 0 0 0。
      • 所以 d π ( s ) d_\pi(s) dπ(s) 反映了该策略下"现实中真正重要的状态"。
  3. 总结:
    • stationary distribution 描述了策略 π \pi π 下智能体 长期访问状态的概率分布
    • 它也叫 steady-state distributionlimiting distribution
    • 价值函数逼近策略梯度方法 中起着关键作用,因为它决定了优化时不同状态的重要性权重。

Optimization algorithms

Gradient Descent for Value Function Approximation

  • While we have the objective function, the next step is to optimize it.

  • To minimize the objective function J ( w ) J(w) J(w), we can use the gradient-descent algorithm:

    w k + 1 = w k − α k ∇ w J ( w k ) w_{k+1} = w_k - \alpha_k \nabla_w J(w_k) wk+1=wk−αk∇wJ(wk)

  • The true gradient is:

    ∇ w J ( w ) = ∇ w E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] \nabla_w J(w) = \nabla_w \mathbb{E}[(v_\pi(S) - \hat v(S,w))^2] ∇wJ(w)=∇wE[(vπ(S)−v^(S,w))2]

    = E [ ∇ w ( v π ( S ) − v ^ ( S , w ) ) 2 ] = \mathbb{E}[\nabla_w (v_\pi(S) - \hat v(S,w))^2] =E[∇w(vπ(S)−v^(S,w))2]

    = 2 E [ ( v π ( S ) − v ^ ( S , w ) ) ( − ∇ w v ^ ( S , w ) ) ] = 2\mathbb{E}[(v_\pi(S) - \hat v(S,w))(-\nabla_w \hat v(S,w))] =2E[(vπ(S)−v^(S,w))(−∇wv^(S,w))]

    = − 2 E [ ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) ] = -2\mathbb{E}[(v_\pi(S) - \hat v(S,w))\nabla_w \hat v(S,w)] =−2E[(vπ(S)−v^(S,w))∇wv^(S,w)]

  • The true gradient above involves the calculation of an expectation.

Stochastic Gradient

  • We can use the stochastic gradient to replace the true gradient:

    w t + 1 = w t + α t ( v π ( s t ) − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) , w_{t+1} = w_t + \alpha_t (v_\pi(s_t) - \hat v(s_t, w_t)) \nabla_w \hat v(s_t, w_t), wt+1=wt+αt(vπ(st)−v^(st,wt))∇wv^(st,wt),

    • where s t s_t st is a sample of S S S. Here, 2 α k 2\alpha_k 2αk is merged to α k \alpha_k αk.
  • This algorithm is not implementable because it requires the true state value v π v_\pi vπ, which is the unknown to be estimated.

  • We can replace v π ( s t ) v_\pi(s_t) vπ(st) with an approximation so that the algorithm is implementable.

Monte Carlo and TD Learning with Function Approximation

  • First, Monte Carlo learning with function approximation

    Let g t g_t gt be the discounted return starting from s t s_t st in the episode. Then, g t g_t gt can be used to approximate v π ( s t ) v_\pi(s_t) vπ(st). The algorithm becomes:

    w t + 1 = w t + α t ( g t − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t (g_t - \hat v(s_t, w_t)) \nabla_w \hat v(s_t, w_t). wt+1=wt+αt(gt−v^(st,wt))∇wv^(st,wt).

  • Second, TD learning with function approximation

    By the spirit of TD learning, r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1} + \gamma \hat v(s_{t+1}, w_t) rt+1+γv^(st+1,wt) can be viewed as an approximation of v π ( s t ) v_\pi(s_t) vπ(st). Then, the algorithm becomes:

    w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t [r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t)] \nabla_w \hat v(s_t, w_t). wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt).

深入解释

  1. 为什么要用 Gradient Descent?

    • 我们有一个目标函数:

      J ( w ) = E [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] . J(w)=\mathbb{E}[(v_\pi(S)-\hat v(S,w))^2]. J(w)=E[(vπ(S)−v^(S,w))2].

    • 目标是最小化 真实状态价值 v π ( S ) v_\pi(S) vπ(S) 与 近似函数 v ^ ( S , w ) \hat v(S,w) v^(S,w) 的误差。

    • 这相当于一个 回归问题 :拟合一个函数 v ^ \hat v v^ 来逼近真实的 v π v_\pi vπ。

    • 于是,可以使用最常见的优化方法 ------ 梯度下降 (Gradient Descent)

      w k + 1 = w k − α k ∇ w J ( w k ) . w_{k+1} = w_k - \alpha_k \nabla_w J(w_k). wk+1=wk−αk∇wJ(wk).

      • 也就是说,每一步更新参数 w w w,使得 J ( w ) J(w) J(w) 逐渐减小。
  2. 真梯度 (True Gradient) 的含义

    • 通过链式法则展开:

      ∇ w J ( w ) = − 2 E [ ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) ] . \nabla_w J(w) = -2\mathbb{E}[(v_\pi(S) - \hat v(S,w)) \nabla_w \hat v(S,w)]. ∇wJ(w)=−2E[(vπ(S)−v^(S,w))∇wv^(S,w)].

    • 解释:

      • 误差项 ( v π ( S ) − v ^ ( S , w ) ) (v_\pi(S) - \hat v(S,w)) (vπ(S)−v^(S,w)) 表示预测与真实值之间的差距。
      • 乘上 ∇ w v ^ ( S , w ) \nabla_w \hat v(S,w) ∇wv^(S,w),告诉我们 怎样调整参数 w w w 才能缩小这个差距
      • 负号说明:如果预测比真实值小,就增加 v ^ \hat v v^;反之减少 v ^ \hat v v^。
    • 这和标准的监督学习回归完全一致。

  3. 为什么要用 Stochastic Gradient?

    • 问题在于:

      • 真梯度 ∇ w J ( w ) \nabla_w J(w) ∇wJ(w) 涉及对所有状态 S S S 的期望;
      • 这通常不可行,因为状态空间巨大,而且 v π ( S ) v_\pi(S) vπ(S) 也未知。
    • 于是用 SGD (Stochastic Gradient Descent) 替代:

      w t + 1 = w t + α t ( v π ( s t ) − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) , w_{t+1} = w_t + \alpha_t (v_\pi(s_t) - \hat v(s_t,w_t)) \nabla_w \hat v(s_t,w_t), wt+1=wt+αt(vπ(st)−v^(st,wt))∇wv^(st,wt),

      • 其中 s t s_t st 是一个采样的状态。
        • 好处:只需要一个样本就能更新,成本低。
        • 问题:它依然需要 v π ( s t ) v_\pi(s_t) vπ(st),但这正是我们要估计的未知量。
  4. 怎么替代 v π ( s t ) v_\pi(s_t) vπ(st)?

    因为 v π ( s t ) v_\pi(s_t) vπ(st) 无法直接获得,我们需要找到可以近似它的替代量:

    1. Monte Carlo Learning with Function Approximation
      • 使用 episode 完整回报 g t g_t gt 作为 v π ( s t ) v_\pi(s_t) vπ(st) 的无偏估计。

      • 更新公式:

        w t + 1 = w t + α t ( g t − v ^ ( s t , w t ) ) ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t (g_t - \hat v(s_t, w_t)) \nabla_w \hat v(s_t, w_t). wt+1=wt+αt(gt−v^(st,wt))∇wv^(st,wt).

      • 直观理解:

        • g t g_t gt = 从 s t s_t st 出发一路走到底的累计奖励。
        • 用 g t g_t gt 替代 v π ( s t ) v_\pi(s_t) vπ(st),再做梯度下降。
        • 缺点:要等 整条轨迹结束 才能更新;方差大。
    2. TD Learning with Function Approximation
      • 使用 TD Target r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1} + \gamma \hat v(s_{t+1}, w_t) rt+1+γv^(st+1,wt) 来近似 v π ( s t ) v_\pi(s_t) vπ(st)。

      • 更新公式:

        w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) . w_{t+1} = w_t + \alpha_t \big[r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t)\big] \nabla_w \hat v(s_t, w_t). wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt).

      • 直观理解:

        • 不用等整条轨迹,只看一步奖励 r t + 1 r_{t+1} rt+1 加上下一个状态的预测。
        • 属于 bootstrapping:用已有估计来辅助更新。
        • 优点:在线学习,更新快,方差小;缺点:可能引入偏差。

Pseudocode: TD learning with function approximation

  • Initialization: A function v ^ ( s , w ) \hat v(s,w) v^(s,w) that is a differentiable in w w w. Initial parameter w 0 w_0 w0.
  • Aim: Approximate the true state values of a given policy π \pi π.
  • For each episode generated following the policy π \pi π, do
    • For each step ( s t , r t + 1 , s t + 1 ) (s_t, r_{t+1}, s_{t+1}) (st,rt+1,st+1), do
      • In the general case,

        w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big] \nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)

      • In the linear case,

        w t + 1 = w t + α t [ r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ] ϕ ( s t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \phi^T(s_{t+1}) w_t - \phi^T(s_t) w_t \big] \phi(s_t) wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st)

Selection of function approximators

  1. Function selection

    • The first approach, which was widely used before , is to use a linear function

      v ^ ( s , w ) = ϕ T ( s ) w \hat v(s, w) = \phi^T(s) w v^(s,w)=ϕT(s)w

      Here, ϕ ( s ) \phi(s) ϕ(s) is the feature vector, which can be a polynomial basis, Fourier basis, ... .

    • The second approach, which is widely used nowadays , is to use a neural network as a nonlinear function approximator. The input of the NN is the state, the output is v ^ ( s , w ) \hat v(s,w) v^(s,w), and the network parameter is w w w.

  2. TD-Linear

    • In the linear case where v ^ ( s , w ) = ϕ T ( s ) w \hat v(s, w) = \phi^T(s) w v^(s,w)=ϕT(s)w, we have

      ∇ w v ^ ( s , w ) = ϕ ( s ) . \nabla_w \hat v(s, w) = \phi(s). ∇wv^(s,w)=ϕ(s).

    • Substituting the gradient into the TD algorithm

      w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big] \nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)

    • yields

      w t + 1 = w t + α t [ r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ] ϕ ( s t ) , w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \phi^T(s_{t+1}) w_t - \phi^T(s_t) w_t \big] \phi(s_t), wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st),

    • which is the algorithm of TD learning with linear function approximation (TD-Linear).

  3. Disadvantages and Advantages of linear function approximation

    • Disadvantages of linear function approximation:
      • Difficult to select appropriate feature vectors.
    • Advantages of linear function approximation:
      • The theoretical properties of the TD algorithm in the linear case can be much better understood than in the nonlinear case.
      • Linear function approximation is still powerful in the sense that the tabular representation is merely a special case of linear function approximation.
  4. Tabular representation as a special case of linear function approximation

    We next show that the tabular representation is a special case of linear function approximation.

    • First, consider the special feature vector for state s s s:

      ϕ ( s ) = e s ∈ R ∣ S ∣ , \phi(s) = e_s \in \mathbb{R}^{|\mathcal{S}|}, ϕ(s)=es∈R∣S∣,

      -where e s e_s es is a vector with the s s s-th entry as 1 1 1 and the others as 0 0 0.

    • In this case,

      v ^ ( s , w ) = e s T w = w ( s ) , \hat v(s, w) = e_s^T w = w(s), v^(s,w)=esTw=w(s),

      • where w ( s ) w(s) w(s) is the s s s-th entry of w w w.
  5. Connection with Tabular TD

    • Recall that the TD-Linear algorithm is

      w t + 1 = w t + α t [ r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ] ϕ ( s t ) . w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \phi^T(s_{t+1}) w_t - \phi^T(s_t) w_t \big] \phi(s_t). wt+1=wt+αt[rt+1+γϕT(st+1)wt−ϕT(st)wt]ϕ(st).

    • When ϕ ( s t ) = e s \phi(s_t) = e_s ϕ(st)=es, the above algorithm becomes

      w t + 1 = w t + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) e s t . w_{t+1} = w_t + \alpha_t \big( r_{t+1} + \gamma w_t(s_{t+1}) - w_t(s_t) \big) e_{s_t}. wt+1=wt+αt(rt+1+γwt(st+1)−wt(st))est.

      • This is a vector equation that merely updates the s t s_t stth entry of w t w_t wt.
    • Multiplying e s t T e_{s_t}^T estT on both sides of the equation gives

      w t + 1 ( s t ) = w t ( s t ) + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) , w_{t+1}(s_t) = w_t(s_t) + \alpha_t \big( r_{t+1} + \gamma w_t(s_{t+1}) - w_t(s_t) \big), wt+1(st)=wt(st)+αt(rt+1+γwt(st+1)−wt(st)),

    • which is exactly the tabular TD algorithm.

方法选择

  1. 该用线性还是神经网络?

    • 线性逼近 : v ^ ( s , w ) = ϕ ( s ) ⊤ w \hat v(s,w)=\phi(s)^\top w v^(s,w)=ϕ(s)⊤w

      你通过手工设计的特征 ϕ ( s ) \phi(s) ϕ(s)(多项式、Fourier、tile coding、one-hot...)把状态映射到低维,再学一个权重向量 w w w。

      • 何时优先用
        • 状态空间不大或可良好表征;
        • 追求可解释性与收敛保证(尤其 on-policy 情形);
        • 算力或数据有限,需要稳健、低方差的学习器。
    • 非线性逼近(NN) :直接学 v ^ ( s , w ) = NN ( s ; w ) \hat v(s,w)=\text{NN}(s;w) v^(s,w)=NN(s;w)。

      • 何时优先用
        • 原始状态是高维/非线性(图像、文本、复杂传感器);
        • 目标是端到端的深度 RL;
        • 有足够数据与算力,且可以接受训练不稳定时的调参成本。
  2. TD-Linear 的本质:半梯度 + 投影不动点

    • 线性情形 ∇ w v ^ ( s , w ) = ϕ ( s ) \nabla_w \hat v(s,w)=\phi(s) ∇wv^(s,w)=ϕ(s)。把它代入 TD(0) 更新:

      w t + 1 = w t + α t [ r t + 1 + γ   v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ]    ϕ ( s t ) . w_{t+1}=w_t+\alpha_t\Big[r_{t+1}+\gamma\,\hat v(s_{t+1},w_t)-\hat v(s_t,w_t)\Big]\;\phi(s_t). wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]ϕ(st).

    • 这是半梯度(semi-gradient)方法:把 TD target r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1}+\gamma \hat v(s_{t+1},w_t) rt+1+γv^(st+1,wt) 当作常数 来对 w w w 求导(不反向传播到下一状态的 v ^ \hat v v^ 里),否则会得到"全梯度"算法,实践中反而更容易不稳定。

    • 几何视角 看,它在求解投影 Bellman 方程

      Φ w ≈ Π d π   T π ( Φ w ) , \Phi w \approx \Pi_{d_\pi}\, \mathcal T_\pi(\Phi w), Φw≈ΠdπTπ(Φw),

      • 其中 Φ \Phi Φ 的列空间是特征张成的函数类, Π d π \Pi_{d_\pi} Πdπ 是按stationary distribution d π d_\pi dπ 的最小二乘投影。
      • 含义:环境让你往 T π v \mathcal T_\pi v Tπv 方向走,但你只能留在"可表示"的子空间里,于是把它投影回去
    • 收敛性 (指 on-policy、线性、适当步长): T π \mathcal T_\pi Tπ 是压缩映射,半梯度 TD(0) 在很多条件下收敛到上述投影不动点

  3. 表格(tabular)为什么是线性逼近的特例?

    • 取 one-hot 特征: ϕ ( s ) = e s \phi(s)=e_s ϕ(s)=es。则

      v ^ ( s , w ) = e s ⊤ w = w ( s ) , \hat v(s,w)=e_s^\top w = w(s), v^(s,w)=es⊤w=w(s),

      • 也就是"每个状态一个参数"。
    • 代回 TD-Linear:

      w t + 1 = w t + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) e s t . w_{t+1}=w_t+\alpha_t\big(r_{t+1}+\gamma w_t(s_{t+1})-w_t(s_t)\big)e_{s_t}. wt+1=wt+αt(rt+1+γwt(st+1)−wt(st))est.

    • 仅 s t s_t st 这一维被更新,左乘 e s t ⊤ e_{s_t}^\top est⊤ 得

      w t + 1 ( s t ) = w t ( s t ) + α t ( r t + 1 + γ w t ( s t + 1 ) − w t ( s t ) ) , w_{t+1}(s_t)=w_t(s_t)+\alpha_t\big(r_{t+1}+\gamma w_t(s_{t+1})-w_t(s_t)\big), wt+1(st)=wt(st)+αt(rt+1+γwt(st+1)−wt(st)),

    • 这正是表格 TD(0)

    • 结论:表格 = 线性逼近 + one-hot 特征

  4. 和目标函数 J(w) 的关系

    • 线性 on-policy 情况下,半梯度 TD 并不是直接最小化

      J ( w ) = E S ∼ d π  ⁣ [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] , J(w)=\mathbb E_{S\sim d_\pi}\!\big[(v_\pi(S)-\hat v(S,w))^2\big], J(w)=ES∼dπ[(vπ(S)−v^(S,w))2],

      • 而是逼近投影 Bellman 解 。两者在一般情况下并不相同,但在很多问题上,这个解既可计算效果好
    • 若你确实想"真最小化" J ( w ) J(w) J(w),需要能访问 v π v_\pi vπ 或用 MC 回报近似它,此时会回到 MC+函数逼近 的更新

      w t + 1 = w t + α t   ( g t − v ^ ( s t , w t ) )   ϕ ( s t ) , w_{t+1}=w_t+\alpha_t\,(g_t-\hat v(s_t,w_t))\,\phi(s_t), wt+1=wt+αt(gt−v^(st,wt))ϕ(st),

      • 方差更大,但目标一致。

Theoretical analysis

  1. The algorithm

    w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big] \nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)

    does not minimize the following objective function:

    J ( w ) = E  ⁣ [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] J(w) = \mathbb{E}\!\big[ ( v_\pi(S) - \hat v(S, w) )^2 \big] J(w)=E[(vπ(S)−v^(S,w))2]

  2. Different objective functions

    • Objective function 1: True value error

      复制代码
        $J_E(w) = \mathbb{E}\!\big[ ( v_\pi(S) - \hat v(S, w) )^2 \big]

      = | \hat v(w) - v_\pi |^2_D$

    • Objective function 2: Bellman error

      J B E ( w ) = ∥ v ^ ( w ) − ( r π + γ P π v ^ ( w ) ) ∥ D 2 ≐ ∥ v ^ ( w ) − T π ( v ^ ( w ) ) ∥ D 2 J_{BE}(w) = \| \hat v(w) - (r_\pi + \gamma P_\pi \hat v(w)) \|^2_D \doteq \| \hat v(w) - T_\pi(\hat v(w)) \|^2_D JBE(w)=∥v^(w)−(rπ+γPπv^(w))∥D2≐∥v^(w)−Tπ(v^(w))∥D2

      • where

        T π ( x ) ≐ r π + γ P π x T_\pi(x) \doteq r_\pi + \gamma P_\pi x Tπ(x)≐rπ+γPπx

    • Objective function 3: Projected Bellman error

      J P B E ( w ) = ∥ v ^ ( w ) − M T π ( v ^ ( w ) ) ∥ D 2 J_{PBE}(w) = \| \hat v(w) - M T_\pi(\hat v(w)) \|^2_D JPBE(w)=∥v^(w)−MTπ(v^(w))∥D2

      • where M M M is a projection matrix.

算法间的差距

  1. True value error

    J E ( w ) = E  ⁣ [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] = ∥ v ^ ( w ) − v π ∥ D 2 J_E(w) = \mathbb{E}\!\big[(v_\pi(S) - \hat v(S, w))^2\big] = \| \hat v(w) - v_\pi \|^2_D JE(w)=E[(vπ(S)−v^(S,w))2]=∥v^(w)−vπ∥D2

    • 含义:直接最小化近似值函数 v ^ ( s , w ) \hat v(s,w) v^(s,w) 与真实值函数 v π ( s ) v_\pi(s) vπ(s) 的差距。
    • 理想目标:这是最自然、最直观的优化目标(类似 supervised learning)。
    • 问题:我们 不知道 v π ( s ) v_\pi(s) vπ(s),只能通过采样和 Bellman 方程间接近似,因此无法直接最小化这个目标。
  2. Bellman error

    J B E ( w ) = ∥ v ^ ( w ) − T π ( v ^ ( w ) ) ∥ D 2 J_{BE}(w) = \| \hat v(w) - T_\pi(\hat v(w)) \|^2_D JBE(w)=∥v^(w)−Tπ(v^(w))∥D2

    • 其中

      T π ( x ) = r π + γ P π x T_\pi(x) = r_\pi + \gamma P_\pi x Tπ(x)=rπ+γPπx

    • 含义:衡量 v ^ ( s , w ) \hat v(s,w) v^(s,w) 与 Bellman 方程的不一致程度。

      • Bellman 方程的固定点就是 v π v_\pi vπ。
      • 如果 v ^ ( w ) \hat v(w) v^(w) 落在"完美的"函数空间里,最小化 Bellman 误差就能得到真实值函数。
      • 问题:当函数近似器(如线性函数或神经网络)不能精确表示 v π v_\pi vπ 时,直接最小化 Bellman 误差可能会得到"发散"的解。
  3. Projected Bellman error

    J P B E ( w ) = ∥ v ^ ( w ) − M T π ( v ^ ( w ) ) ∥ D 2 J_{PBE}(w) = \| \hat v(w) - M T_\pi(\hat v(w)) \|^2_D JPBE(w)=∥v^(w)−MTπ(v^(w))∥D2

    • 其中 M 是一个投影矩阵,把 Bellman 更新结果 T π ( v ^ ) T_\pi(\hat v) Tπ(v^) 投影回函数近似空间。
    • 含义:由于函数近似空间(比如线性函数空间)通常比较有限,我们无法保 证 v ^ ( w ) 证 \hat v(w) 证v^(w) 可以完美满足 Bellman 方程。所以我们要求 投影后的 Bellman 更新结果 尽可能接近 v ^ ( w ) \hat v(w) v^(w)。
    • 本质:寻找一个在函数空间中"最接近 Bellman 固定点"的近似解。
    • 重要性:这是 TD-Linear 实际优化的目标 。TD 的更新过程相当于隐式地做了这个投影,因此收敛点是最小化 projected Bellman error 的解,而不是直接的 true value error。
  4. 为什么 TD-Linear 对应 projected Bellman error?

    • TD 更新:

      w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) w_{t+1} = w_t + \alpha_t \big[ r_{t+1} + \gamma \hat v(s_{t+1}, w_t) - \hat v(s_t, w_t) \big]\nabla_w \hat v(s_t, w_t) wt+1=wt+αt[rt+1+γv^(st+1,wt)−v^(st,wt)]∇wv^(st,wt)

    • 这里的更新使用的是 TD target r t + 1 + γ v ^ ( s t + 1 , w t ) r_{t+1} + \gamma \hat v(s_{t+1}, w_t) rt+1+γv^(st+1,wt),它相当于把 Bellman 更新结果 T π ( v ^ ) T_\pi(\hat v) Tπ(v^) 投影回函数近似空间。

    • 因此,TD 不会直接最小化 J E ( w ) J_E(w) JE(w) J B E ( w ) J_{BE}(w) JBE(w),而是最小化 projected Bellman error

  5. 总结

    • True value error:最理想的目标,但无法直接计算。
    • Bellman error:衡量与 Bellman 方程的不一致,但函数逼近时可能不稳定。
    • Projected Bellman error:TD 实际最小化的目标,在逼近空间内找到最合理的解,保证了收敛性。

Sarsa & Q-learning with function approximation

Sarsa with function approximation

Sarsa algorithm

  • So far, we merely considered the problem of state value estimation. That is we hope

    v ^ ≈ v π \hat v \approx v_\pi v^≈vπ

  • To search for optimal policies, we need to estimate action values.

  • The Sarsa algorithm with value function approximation is

    w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) . w_{t+1} = w_t + \alpha_t \Big[r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t)\Big] \nabla_w \hat q(s_t, a_t, w_t). wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt).

  • This is the same as the algorithm we introduced previously in this lecture except that v ^ \hat v v^ is replaced by q ^ \hat q q^.

Pseudocode: Sarsa with function approximation

  • Aim : Search a policy that can lead the agent to the target from an initial state-action pair ( s 0 , a 0 ) (s_0, a_0) (s0,a0).
  • For each episode, do
    • If the current s t s_t st is not the target state, do
      • Take action a t a_t at following π t ( s t ) \pi_t(s_t) πt(st), generate r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1, and then take action a t + 1 a_{t+1} at+1 following π t ( s t + 1 ) \pi_t(s_{t+1}) πt(st+1)

      • Value update (parameter update):

        w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) w_{t+1} = w_t + \alpha_t \Big[r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t)\Big]\nabla_w \hat q(s_t, a_t, w_t) wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt)

      • Policy update:

        π t + 1 ( a ∣ s t ) = 1 − ε ∣ A ( s ) ∣ ( ∣ A ( s ) ∣ − 1 ) if a = arg ⁡ max ⁡ a ∈ A ( s t ) q ^ ( s t , a , w t + 1 ) \pi_{t+1}(a|s_t) = 1 - \frac{\varepsilon}{|\mathcal{A}(s)|} (|\mathcal{A}(s)| - 1) \quad \text{if } a = \arg\max_{a \in \mathcal{A}(s_t)} \hat q(s_t, a, w_{t+1}) πt+1(a∣st)=1−∣A(s)∣ε(∣A(s)∣−1)if a=argmaxa∈A(st)q^(st,a,wt+1)

        π t + 1 ( a ∣ s t ) = ε ∣ A ( s ) ∣ otherwise \pi_{t+1}(a|s_t) = \frac{\varepsilon}{|\mathcal{A}(s)|} \quad \text{otherwise} πt+1(a∣st)=∣A(s)∣εotherwise

Sarsa with function approximation

  • 公式:

    w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) . w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t). wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)]∇wq^(st,at,wt).

  • 含义:

    • 这里 q ^ ( s , a , w ) \hat q(s, a, w) q^(s,a,w) 是 近似的 action-value function ,用参数 w w w 来表示(比如线性函数或神经网络)。

    • TD target 使用的是 实际执行的下一步动作 a t + 1 a_{t+1} at+1:

      r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) rt+1+γq^(st+1,at+1,wt)

    • 这意味着 Sarsa 是 on-policy 算法

      • 行为策略 π \pi π 既用于生成数据(选择 a t a_t at, a t + 1 a_{t+1} at+1),
      • 也用于更新价值函数。
    • 更新方向由 TD error 决定:

      δ t = r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) \delta_t = r_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w_t) - \hat q(s_t, a_t, w_t) δt=rt+1+γq^(st+1,at+1,wt)−q^(st,at,wt)

      • 然后对当前 q ^ ( s t , a t , w t ) \hat q(s_t, a_t, w_t) q^(st,at,wt) 的参数做梯度修正。

Q-learning with function approximation

Q-learning algorithm

  • Similar to Sarsa, tabular Q-learning can also be extended to the case of value function approximation.

  • The q-value update rule is

    w t + 1 = w t + α t [ r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) , w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t), wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt),

  • which is the same as Sarsa except that q ^ ( s t + 1 , a t + 1 , w t ) \hat q(s_{t+1}, a_{t+1}, w_t) q^(st+1,at+1,wt) is replaced by max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) maxa∈A(st+1)q^(st+1,a,wt).

Pseudocode: Q-learning with function approximation (on-policy version)

  • Initialization: Initial parameter vector w 0 w_0 w0. Initial policy π 0 \pi_0 π0. Small ε > 0 \varepsilon > 0 ε>0.
  • Aim: Search a good policy that can lead the agent to the target from an initial state-action pair ( s 0 , a 0 ) (s_0, a_0) (s0,a0).
  • For each episode, do
    • If the current s t s_t st is not the target state, do
      • Take action a t a_t at following π t ( s t ) \pi_t(s_t) πt(st), and generate r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1

      • Value update (parameter update):

        w t + 1 = w t + α t [ r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t) wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt)

      • Policy update:

        π t + 1 ( a ∣ s t ) = 1 − ε ∣ A ( s ) ∣ ( ∣ A ( s ) ∣ − 1 ) if a = arg ⁡ max ⁡ a ∈ A ( s t ) q ^ ( s t , a , w t + 1 ) \pi_{t+1}(a|s_t) = 1 - \frac{\varepsilon}{|\mathcal{A}(s)|} (|\mathcal{A}(s)| - 1) \quad \text{if } a = \arg\max_{a \in \mathcal{A}(s_t)} \hat q(s_t, a, w_{t+1}) πt+1(a∣st)=1−∣A(s)∣ε(∣A(s)∣−1)if a=argmaxa∈A(st)q^(st,a,wt+1)

        π t + 1 ( a ∣ s t ) = ε ∣ A ( s ) ∣ otherwise \pi_{t+1}(a|s_t) = \frac{\varepsilon}{|\mathcal{A}(s)|} \quad \text{otherwise} πt+1(a∣st)=∣A(s)∣εotherwise

Q-learning with function approximation

  • 公式:

    w t + 1 = w t + α t [ r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) . w_{t+1} = w_t + \alpha_t \Big[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) \Big] \nabla_w \hat q(s_t, a_t, w_t). wt+1=wt+αt[rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)]∇wq^(st,at,wt).

  • 含义:

    • 同样 q ^ ( s , a , w ) \hat q(s, a, w) q^(s,a,w) 是参数化的 Q 函数。

    • TD target 使用的是 下一状态所有可能动作的最大值

      r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)

    • 这意味着 Q-learning 是 off-policy 算法

      • 行为策略(behavior policy)可以是探索性的,比如 ϵ \epsilon ϵ-greedy。
      • 但是更新时假设 agent 永远选择"最优动作",因为取了 max ⁡ \max max。
    • 更新方式仍然基于 TD error:

      δ t = r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) \delta_t = r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat q(s_{t+1}, a, w_t) - \hat q(s_t, a_t, w_t) δt=rt+1+γmaxa∈A(st+1)q^(st+1,a,wt)−q^(st,at,wt)

      • 再做参数更新。

Deep Q-learning

Objective function

  • Definition

    • Deep Q-learning aims to minimize the objective function/loss function:

      J ( w ) = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] , J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right], J(w)=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2],

      • where ( S , A , R , S ' ) (S, A, R, S') (S,A,R,S') are random variables.
    • This is actually the Bellman optimality error.

    • That is because

      q ( s , a ) = E [ R t + 1 + γ max ⁡ a ∈ A ( S t + 1 ) q ( S t + 1 , a )   ∣   S t = s , A t = a ] , ∀ s , a q(s, a) = \mathbb{E}\Big[ R_{t+1} + \gamma \max_{a \in \mathcal{A}(S_{t+1})} q(S_{t+1}, a) \,\Big|\, S_t = s, A_t = a \Big], \quad \forall s, a q(s,a)=E[Rt+1+γmaxa∈A(St+1)q(St+1,a) St=s,At=a],∀s,a

      • The value of

        R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w)

      • should be zero in the expectation sense.

  • How to minimize the objective function? Gradient-descent!

    • In this objective function

      J ( w ) = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] , J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right], J(w)=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2],

      • the parameter w not only appears in q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w) but also in

        y ≐ R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) . y \doteq R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w). y≐R+γmaxa∈A(S')q^(S',a,w).

    • For the sake of simplicity, we can assume that w w w in y y y is fixed (at least for a while) when we calculate the gradient.

Deep Q-learning 的目标函数与 Bellman optimality error 的关系

  1. Deep Q-learning 的目标函数

    J ( w ) = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right] J(w)=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2]

    • 其中:
      • R R R:当前状态执行动作 A A A 后得到的奖励;
      • S ' S' S':下一个状态;
      • q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w):由神经网络(带参数 w w w)给出的 Q Q Q 值近似;
      • max ⁡ a ∈ A ( S ' ) \max_{a \in \mathcal{A}(S')} maxa∈A(S'):在下一个状态选择最优动作对应的 Q Q Q 值。
    • 这个目标函数就是一个 均方误差 (MSE) ,它度量的是 Q Q Q 网络的输出 q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w) 与目标值 R + γ max ⁡ a q ^ ( S ' , a , w ) R + \gamma \max_{a} \hat{q}(S', a, w) R+γmaxaq^(S',a,w) 之间的差距。
  2. 为什么它对应 Bellman optimality error?

    • Bellman 最优方程定义了最优 Q Q Q 值的递推关系:

      q ( s , a ) = E [ R t + 1 + γ max ⁡ a ' ∈ A ( S t + 1 ) q ( S t + 1 , a ' )   ∣   S t = s , A t = a ] q(s, a) = \mathbb{E}\Big[ R_{t+1} + \gamma \max_{a' \in \mathcal{A}(S_{t+1})} q(S_{t+1}, a') \,\Big|\, S_t = s, A_t = a \Big] q(s,a)=E[Rt+1+γmaxa'∈A(St+1)q(St+1,a') St=s,At=a]

    • 换句话说,如果 q ^ \hat{q} q^ 是最优的,那么:

      R + γ max ⁡ a q ^ ( S ' , a , w ) − q ^ ( S , A , w ) = 0 R + \gamma \max_{a} \hat{q}(S', a, w) - \hat{q}(S, A, w) = 0 R+γmaxaq^(S',a,w)−q^(S,A,w)=0

      • 在期望意义下应该完全成立。
    • 但是实际中 q ^ \hat{q} q^ 是近似函数,所以它并不能完全满足 Bellman 方程。

      • 于是我们就把这个残差(差距)定义为 Bellman optimality error,并通过最小化它来逼近最优 Q 值。
  3. 为什么优化比较 tricky?

    • 在损失函数

      J ( w ) = E [ ( R + γ max ⁡ a q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J(w) = \mathbb{E}\left[\Big(R + \gamma \max_{a} \hat{q}(S', a, w) - \hat{q}(S, A, w)\Big)^2\right] J(w)=E[(R+γmaxaq^(S',a,w)−q^(S,A,w))2]

      • 里面,参数 w w w 既出现在当前的 q ^ ( S , A , w ) \hat{q}(S, A, w) q^(S,A,w),也出现在目标值 R + γ max ⁡ a q ^ ( S ' , a , w ) R + \gamma \max_{a} \hat{q}(S', a, w) R+γmaxaq^(S',a,w) 中。
      • 这就导致梯度计算比较复杂,因为我们同时要对 预测值目标值 求梯度。
    • 为简化计算,DQN 通常采用 固定目标网络(target network) 的方法:

      • 在一段时间内,把目标部分 R + γ max ⁡ a q ^ ( S ' , a , w ) R + \gamma \max_{a} \hat{q}(S', a, w) R+γmaxaq^(S',a,w) 的参数 w w w 固定;
      • 只更新当前 Q 网络的参数。
    • 这样就可以避免梯度传播的复杂性。

Two networks

  • Introduction
    • One is a main network representing q ^ ( s , a , w ) \hat q(s,a,w) q^(s,a,w)

    • The other is a target network q ^ ( s , a , w T ) \hat q(s,a,w_T) q^(s,a,wT).

    • The objective function in this case degenerates to

      J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) 2 ] , J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)^2\Big], J=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))2],

      • where w T w_T wT is the target network parameter.
  • Gradient with fixed target network
    • When w T w_T wT is fixed, the gradient of J J J can be easily obtained as

      ∇ w J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) ∇ w q ^ ( S , A , w ) ] . \nabla_w J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)\nabla_w \hat q(S,A,w)\Big]. ∇wJ=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))∇wq^(S,A,w)].

    • The basic idea of deep Q-learning is to use the gradient-descent algorithm to minimize the objective function.

在 DQN 里,如果只用一个网络 q ^ ( s , a , w ) \hat q(s,a,w) q^(s,a,w) 来估计 Q Q Q 值并同时更新参数,会遇到 训练不稳定 的问题。原因是目标值(TD target)和估计值(prediction)都依赖于同一个网络,参数更新会相互干扰。

解决办法:

  • 引入 两个网络
    1. Main network : q ^ ( s , a , w ) \hat q(s,a,w) q^(s,a,w),用来学习和更新参数。
    2. Target network : q ^ ( s , a , w T ) \hat q(s,a,w_T) q^(s,a,wT),用来生成相对稳定的目标值。
      • 参数 w T w_T wT 会定期从 w w w 同步(例如每隔 C C C 步复制一次)。
  • 这样目标值不会在每一步都随 w w w 的更新而改变,从而降低训练震荡。

目标函数:

J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) 2 ] J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)^2\Big] J=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))2]

  • 当前 Q 值估计: q ^ ( S , A , w ) \hat q(S,A,w) q^(S,A,w)
  • 目标 Q 值(TD target): R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) R+γmaxa∈A(S')q^(S',a,wT)

梯度下降更新:

∇ w J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w T ) − q ^ ( S , A , w ) ) ∇ w q ^ ( S , A , w ) ] \nabla_w J = \mathbb{E}\Big[\Big(R+\gamma \max_{a\in \mathcal{A}(S')} \hat q(S',a,w_T) - \hat q(S,A,w)\Big)\nabla_w \hat q(S,A,w)\Big] ∇wJ=E[(R+γmaxa∈A(S')q^(S',a,wT)−q^(S,A,w))∇wq^(S,A,w)]

  • 当 w T w_T wT 固定时,梯度计算非常清晰,不会被目标值同时更新而扰动。

总结

  • Two networks (Main & Target)解决了 目标值不稳定 的问题。

Two techniques

  1. First technique: Two networks, a main network and a target network
    • Why is it used?
      • The mathematical reason has been explained when we calculate the gradient.
    • Implementation details:
      • Let w w w and w T w_T wT denote the parameters of the main and target networks, respectively. They are set to be the same initially.
      • In every iteration, we draw a mini-batch of samples ( s , a , r , s ' ) {(s,a,r,s')} (s,a,r,s') from the replay buffer (will be explained later).
      • The inputs of the networks include state s s s and action a a a.
        • The target output is

          y T ≐ r + γ max ⁡ a ∈ A ( s ' ) q ^ ( s ' , a , w T ) . y_T \doteq r + \gamma \max_{a\in \mathcal{A}(s')} \hat q(s',a,w_T). yT≐r+γmaxa∈A(s')q^(s',a,wT).

        • Then, we directly minimize the TD error or called loss function

          ( y T − q ^ ( s , a , w ) ) 2 (y_T - \hat q(s,a,w))^2 (yT−q^(s,a,w))2

          • over the mini-batch ( s , a , y T ) {(s,a,y_T)} (s,a,yT).
  2. Another technique: Experience replay
    • Question: What is experience replay?

    • Answer:

      • After we have collected some experience samples, we do NOT use these samples in the order they were collected.
      • Instead, we store them in a set, called replay buffer B ≐ ( s , a , r , s ' ) \mathcal{B} \doteq {(s,a,r,s')} B≐(s,a,r,s').
      • Every time we train the neural network, we can draw a mini-batch of random samples from the replay buffer.
      • The draw of samples, or called experience replay , should follow a uniform distribution (why?).
    • Question: Why is experience replay necessary in deep Q-learning? Why does the replay must follow a uniform distribution?

    • Answer: The answers lie in the objective function.

      J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J = \mathbb{E}\left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) \right)^2 \right] J=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2]

      • ( S , A ) ∼ d : ( S , A ) (S, A) \sim d: (S, A) (S,A)∼d:(S,A) is an index and treated as a single random variable
      • R ∼ p ( R ∣ S , A ) , S ' ∼ p ( S ' ∣ S , A ) : R R \sim p(R|S,A), S' \sim p(S'|S,A): R R∼p(R∣S,A),S'∼p(S'∣S,A):R and S S S are determined by the system model.
      • The distribution of the state-action pair ( S , A ) (S, A) (S,A) is assumed to be uniform.
      • However, the samples are not uniformly collected because they are generated consequently by certain policies.
      • To break the correlation between consequent samples, we can use the experience replay technique by uniformly drawing samples from the replay buffer.
      • This is the mathematical reason why experience replay is necessary and why the experience replay must be uniform.

Experience replay (经验回放)

问题:

  • 如果我们按顺序使用交互数据来更新网络,样本之间是强相关的(例如 s t , s t + 1 , s t + 2 s_t, s_{t+1}, s_{t+2} st,st+1,st+2),不符合随机采样的假设,会导致训练不稳定甚至发散。

解决办法:

  • 引入 Replay Buffer B \mathcal{B} B 来存储过往的经验 ( s , a , r , s ' ) (s,a,r,s') (s,a,r,s')。
  • 每次训练时,不是直接用最近的数据,而是 随机抽取一个 mini-batch 来打破样本相关性。

数学解释:

  • 目标函数为:

    J = E [ ( R + γ max ⁡ a ∈ A ( S ' ) q ^ ( S ' , a , w ) − q ^ ( S , A , w ) ) 2 ] J = \mathbb{E}\left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) \right)^2 \right] J=E[(R+γmaxa∈A(S')q^(S',a,w)−q^(S,A,w))2]

    • ( S , A ) ∼ d (S,A) \sim d (S,A)∼d: ( S , A ) (S,A) (S,A) 被看作一个随机变量
    • 理论上, ( S , A ) (S,A) (S,A) 的分布应该是 均匀的
    • 但实际收集的数据由当前策略产生,不是均匀分布的(可能更集中在某些区域)

经验回放的作用:

  1. 打破样本相关性(避免梯度更新时出现偏差)。
  2. 近似均匀采样 ,使得 ( S , A ) (S,A) (S,A) 的经验分布接近理论假设的均匀分布。
  3. 提高数据利用率(同一个样本可以被多次使用)。

总结

  • Experience replay 解决了 样本相关性与分布偏差 的问题。

Revisit the tabular case:

  • Question: Why does not tabular Q-learning require experience replay?

    • Answer: No uniform distribution requirement.
  • Question: Why Deep Q-learning involves distribution?

    • Answer: The objective function in the deep case is a scalar average over all ( S , A ) (S, A) (S,A).

      The tabular case does not involve any distribution of S S S or A A A.

      The algorithm in the tabular case aims to solve a set of equations for all ( s , a ) (s,a) (s,a) (Bellman optimality equation).

  • Question: Can we use experience replay in tabular Q-learning?

    • Answer: Yes, we can. And more sample efficient (why?).

为什么 tabular Q-learning 不需要经验回放,而 deep Q-learning 需要

  1. Tabular Q-learning 的特点

    • 存储方式 :每个状态-动作对 ( s , a ) (s,a) (s,a) 都有一个对应的 Q Q Q 值表项 Q ( s , a ) Q(s,a) Q(s,a)。

    • 更新方式 :更新是 局部的 ,只影响当前的 ( s , a ) (s,a) (s,a):

      Q ( s , a ) ← Q ( s , a ) + α [ r + γ max ⁡ a ' Q ( s ' , a ' ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha \Big[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \Big] Q(s,a)←Q(s,a)+α[r+γmaxa'Q(s',a')−Q(s,a)]

    • 无分布要求

      • 在表格方法中,我们实际上是在"解方程"(Bellman 方程组),只要所有 ( s , a ) (s,a) (s,a) 都被访问到,无论采样分布是否均匀,最终都能收敛到最优解 Q ∗ Q^* Q∗。
      • 因此,不需要对采样分布做均匀化的要求,也就不需要经验回放来打破采样的相关性。
  2. Deep Q-learning 的特点

    • 存储方式 : Q Q Q 值不是用表格存储,而是用 神经网络近似

      Q ( s , a ; w ) ≈ Q ∗ ( s , a ) Q(s,a;w) \approx Q^*(s,a) Q(s,a;w)≈Q∗(s,a)

      • 参数 w w w 是共享的,因此一次更新会影响 所有 ( s , a ) (s,a) (s,a) 的估计,而不是仅仅一个表项。
    • 目标函数

      • 深度 Q 学习的目标函数是一个 均方误差 (MSE)

        J ( w ) = E [ ( r + γ max ⁡ a ' Q ( s ' , a ' ; w ) − Q ( s , a ; w ) ) 2 ] J(w) = \mathbb{E}\Big[\big(r + \gamma \max_{a'} Q(s',a';w) - Q(s,a;w)\big)^2\Big] J(w)=E[(r+γmaxa'Q(s',a';w)−Q(s,a;w))2]

      • 注意:这里的期望是对 状态-动作对 ( s , a ) (s,a) (s,a) 的分布 取的。

    • 分布问题

      • 如果训练样本高度相关(例如连续从同一个 episode 采样),网络会过拟合局部轨迹,梯度估计偏差很大。
      • 目标函数隐含假设 ( S , A ) (S,A) (S,A) 是独立同分布 (i.i.d.),但实际 RL 环境中采样是序列相关的。
  3. 为什么 Deep Q-learning 需要经验回放

    • 经验回放 (Experience Replay) 做了两件事:
      1. 打破相关性:从 replay buffer 中均匀采样,打乱序列相关性,近似满足 i.i.d. 假设。
      2. 提高样本利用率:同一个样本可以被多次采样更新,而不是一次性丢弃。
    • 数学解释
      • 如果我们不使用经验回放,那么在计算期望时:

        ( S , A ) ∼ d π (S,A) \sim d_\pi (S,A)∼dπ

      • 这个分布 d π d_\pi dπ 会强烈依赖于当前策略和轨迹,导致梯度估计不稳定。

      • 使用 replay buffer 并均匀采样后,可以近似模拟出一个 接近均匀的采样分布,稳定训练。

  4. 对比总结

    • Tabular Q-learning :更新是局部的,不需要采样分布均匀性,只要覆盖所有 ( s , a ) (s,a) (s,a),就能收敛。
    • Deep Q-learning:更新是全局的,依赖于目标函数的期望分布,需要经验回放来保证样本分布近似均匀,避免梯度偏差。

Pseudocode: Deep Q-learning (off-policy version)

  • Aim: Learn an optimal target network to approximate the optimal action values from the experience samples generated by a behavior policy π b \pi_b πb.
  • Store the experience samples generated by π b \pi_b πb in a replay buffer B = { ( s , a , r , s ' ) } \mathcal{B} = \{(s,a,r,s')\} B={(s,a,r,s')}
    • For each iteration, do

      • Uniformly draw a mini-batch of samples from B \mathcal{B} B

      • For each sample ( s , a , r , s ' ) (s,a,r,s') (s,a,r,s'), calculate the target value as

        y T = r + γ max ⁡ a ∈ A ( s ' ) q ^ ( s ' , a , w T ) , y_T = r + \gamma \max_{a \in \mathcal{A}(s')} \hat q(s',a,w_T), yT=r+γmaxa∈A(s')q^(s',a,wT),

        • where w T w_T wT is the parameter of the target network
    • Update the main network to minimize

      ( y T − q ^ ( s , a , w ) ) 2 (y_T - \hat q(s,a,w))^2 (yT−q^(s,a,w))2

      • using the mini-batch { ( s , a , y T ) } \{(s,a,y_T)\} {(s,a,yT)}
    • Set w T = w w_T = w wT=w every C C C iterations


总结

从 表格方法 (Tabular Q-learning) 到 函数逼近 (Sarsa/Q-learning with function approximation),再到 深度强化学习 (DQN),核心都是最小化 Bellman 误差,区别在于:

  • 表格方法直接解方程,不依赖样本分布;
  • 函数逼近需要引入梯度下降与投影;
  • 深度 Q-learning 则通过 目标网络 (Target Network) 和 经验回放 (Experience Replay) 稳定训练神经网络近似器。
相关推荐
zzzyzh1 天前
RL【7-1】:Temporal-difference Learning
强化学习
大千AI助手3 天前
VeRL:强化学习与大模型训练的高效融合框架
人工智能·深度学习·神经网络·llm·强化学习·verl·字节跳动seed
zzzyzh6 天前
RL【3】:Bellman Optimality Equation
强化学习
deepdata_cn10 天前
强化学习框架(AReaL)
强化学习
计算机sci论文精选13 天前
CVPR 强化学习模块深度分析:连多项式不等式+自驾规划
人工智能·深度学习·机器学习·计算机视觉·机器人·强化学习·cvpr
Baihai_IDP14 天前
强化学习的“GPT-3 时刻”即将到来
人工智能·llm·强化学习
@LijinLiu14 天前
强化学习基本实操
计算机视觉·强化学习
龙腾亚太22 天前
基于深度强化学习的无人机自主感知−规划−控制策略
机器学习·无人机·强化学习·深度强化学习
聚客AI23 天前
🧩万亿级Token训练!解密大模型预训练算力黑洞与RLHF对齐革命
人工智能·llm·强化学习