(其实是专业课作业🤣 感觉算法岗面试可能会问,来存一下档)
目录
- [问题:证明 Value Iteration 收敛性](#问题:证明 Value Iteration 收敛性)
- [0 Definitions - 定义](#0 Definitions - 定义)
- [1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射](#1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射)
- [2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数](#2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数)
- [Reference - 参考资料](#Reference - 参考资料)
问题:证明 Value Iteration 收敛性
Please prove the convergence of Value Iteration algorithm in discounted reward setting, and analyze the performance between resulting policy and optimal policy. 证明 Value Iteration 的收敛性。
背景:
- Policy Iteration:先得到一个 policy,然后算出它的 value function,基于该 value function 取 max_a Q(s,a) 作为新 policy,再算新 policy 的 value function...
- Value Iteration:不停对 value function 使用贝尔曼算子,新 V = max_a [r(s,a) + γV(s')]。
0 Definitions - 定义
First, we define
- Discrete state space \(S=\{s_1,\cdots,s_n\}\), discrete action space \(A=\{a_1,\cdots,a_n\}\).
- Value function: \(V(s)=E_{a\sim\pi(a|s)}\big[r(s,a) + \gamma E_{s'\sim p(s'|s,a)}V(s')\big]\) .
- Bellman operator: for all \(s\in S\) ,
\[\begin{aligned} B_\pi V(s) &\triangleq E_{a\sim\pi(a|s)}\big[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}V(s') \big] \\ B_* V(s) &\triangleq \max_a \big[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}V(s') \big] \end{aligned} \]
The Value Iteration algorithm is to conduct Bellman operator \(B_*\) on the value function \(V(s)\). We want to prove that the sequence will converge to the optimal value function \(V^*\), which should be the of the Bellman equation \(B_*V=V\) .
我们定义了状态空间、动作空间、值函数(value function)和贝尔曼算子(Bellman operator)。VI 算法,就是对 value function 一直使用 Bellman operator,希望最后能收敛到最优的 value function。这样,该最优 value function 必然是 Bellman equation \(B_*V=V\) 的不动点。
1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射
Then, we need to prove that Bellman Operator is a Contraction Mapping. The definition of Contraction Mapping is as follows:
\[d(f(x_1),f(x_2))\le\gamma d(x_1,x_2), \]
where \(d\) is a measurement over a metric space, \(\gamma\in(0,1)\), and \(f\) is the Contraction Mapping. In our case, the definition of Contraction Mapping is as follows:
\[|f(x_1)-f(x_2)|\infty \le \gamma|x_1-x_2|\infty, \]
where \(||_\infty\) is the infinite norm (return the max value over all dimensions of a vector).
接下来,证明贝尔曼算子是压缩映射(Contraction Mapping)。在这里,压缩映射的定义是,两个变量的无穷范数(向量各维之差的最大绝对值)会被 \(f\) 映射压缩 \(\gamma\) 倍。
Let's prove that Bellman Operator \(B_*\) is a Contraction Mapping. The proof is as follows.
\[\begin{aligned} &|B_*V_1(s)-B_*V_2(s)| \\ &=\bigg|\max_a\big[r(s,a)+\gamma E_{s'~p(s'|s,a)}V_1(s')\big] - \max_a\big[r(s,a)+\gamma E_{s}V_2(s')\big] \bigg| \\ &\le \bigg| \max_a[r(s,a) + \gamma E_{s'}V_1(s') - r(s,a) - \gamma E_{s'}V_2(s')]\bigg| \\ &= \bigg|\gamma\max_a E_{s'}(V_1(s')-V_2(s'))\bigg| \\ &\le \gamma\max_a\bigg|E_{s'}(V_1(s')-V_2(s'))\bigg| \\ &= \gamma\max_a \bigg|\sum_{s'}p(s'|s,a)\big[V_1(s')-V_2(s')\big]\bigg| \\ &\le \gamma\max_a \sum_{s'}p(s'|s,a)\bigg|V_1(s')-V_2(s')\bigg| \\ &= \gamma \sum_{s'}p(s'|s,a_*(s))\bigg|V_1(s')-V_2(s')\bigg| \\ &\le \gamma \sum_{s'}p(s'|s,a_*(s))\max_s\bigg|V_1(s)-V_2(s)\bigg| \\ &= \gamma|V_1-V_2|_\infty \end{aligned} \]
贝尔曼算子是压缩映射的证明:一连串放缩,如上。
2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数
If Bellman Operator is a Contraction Mapping, then we can use the Banach fixed point theorem to prove that, the optimal value function \(V^*\) is the unique fixed point of the Bellman Operator \(B_*\), and the sequence \(B_*(B_*(\cdots B_*V))\) converges to \(V^*\).
如果贝尔曼算子是压缩映射,则可以用巴拿赫不动点定理(Banach fixed point theorem),证明 \(V^*\) 是 Bellman Equation \(B_*V=V\) 的唯一不动点,并且一定会收敛到该不动点。
The first step is to prove that the fixed point \(V^*\) is unique. Use proof by contradiction. Suppose there are two converged value \(V_1^*,V_2^*\), then we have \(|B_*V_1^*-B_*V_2^*|\infty=|V_1^*-V_2^*|\infty\). However, Bellman operator \(B_*\) is a Contraction Mapping, so there is \(|B_*V_1^*-B_*V_2^*|\infty\le\gamma|V_1^*-V_2^*|\infty\), which contradicts the previous one. Thus, there can be only one converged value, and it is obviously the fixed point \(V^*\).
第一步,证明收敛到的不动点是唯一的。使用反证法,假设有两个不动点,那么它们俩的压缩映射都映射到原地(不动点嘛),距离没有 γ 倍压缩,矛盾,从而得证。
The second step is to prove that the fixed point \(V^*\) really exists. If we can prove that \(\{V,B_*V,B_*^2V,\cdots\}\) is a Cauchy sequence, the fixed point \(V^*\) exists. Take two value \(B_*^a,B_*^b\) from the sequence, where \(a\gg b\). Use the triangle inequality of \(||_\infty\) as a measurement, we can get
\[\begin{aligned} |B_*^a-B_*^b|\infty &\le |B*^a-B_*^{a-1}|\infty + |B*^{a-1}-B_*^b|\infty \\ &\le |B*^a-B_*^{a-1}|\infty + |B*^{a-1}-B_*^{a-2}|\infty + |B*^{a-2}-B_*^b|\infty \\ &\le |B*^a-B_*^{a-1}|\infty + \cdots + |B*^{b+1}-B_*^b|\infty \\ &\le \gamma^b|B*V-V|\sum_{i=0}^{a-b-1}\gamma^i \le \frac{\gamma^b}{1-\gamma}|B_*V-V| \end{aligned}. \]
So, if we can choose a large enough \(b\), the value of \(|B_*^a-B_*^b|\infty\) can be less than any \(\epsilon\gt0\). Thus, there must exists an \(N\in\mathbb N\), such that \(|B*^a-B_*^b|_\infty\lt\epsilon\), for all \(a,b\gt N\) . The sequence is a Cauchy sequence, and the fixed point \(V^*\) exists.
第二步,证明不动点 \(V^*\) 存在。去证 \(\{V,B_*V,B_*^2V,\cdots\}\) 是柯西序列(Cauchy sequence),如果是柯西序列,那么就存在收敛的不动点。柯西序列的定义是,对于序列里的 \(x_a,x_b\),存在正整数 N 使得对于任意 \(a,b\gt N\),度量 \(d(x_a,x_b)\lt\epsilon\),其中 \(\epsilon\) 是任意小的正实数。先使用三角不等式,将 \(x_a-x_b\) 的形式拆成 \((x_a-x_{a-1})+(x_{a-1}-x_{a-2})+\cdots+(x_{b+1}-x_b)\),然后使用压缩映射的性质,放缩成 γ 的等比序列求和,即可得到,它小于任意的小实数。
Reference - 参考资料
- A blog from Zhihu: https://zhuanlan.zhihu.com/p/419208786
- 非常感谢!很清晰的博客 😊🌹