【书籍】强化学习第二版（英文版电子版下载、github源码）-附copilot翻译的中英文目录...

Python代码：https://github.com/ShangtongZhang/reinforcement-learning-an-introduction

英文原版书籍下载：http://incompleteideas.net/book/the-book-2nd.html

作者：

理查德·S·萨顿是阿尔伯塔大学计算机科学教授和强化学习与人工智能 AITF 主席，也是 DeepMind 的杰出研究科学家。

安德鲁·G·巴托是马萨诸塞大学阿默斯特分校计算机与信息科学学院的荣誉退休教授。

描述：

这是一本广泛使用的强化学习教材的新版，内容大幅扩充和更新，涵盖了人工智能中最活跃的研究领域之一的强化学习。

强化学习是人工智能中最活跃的研究领域之一，它是一种计算学习方法，通过让一个智能体在与一个复杂、不确定的环境交互的过程中，尝试最大化它所获得的总奖励。在《强化学习》一书中，Richard Sutton 和 Andrew Barto 以清晰简洁的方式介绍了该领域的关键思想和算法。这本第二版在第一版的基础上进行了大幅的扩充和更新，增加了一些新的主题，并更新了一些已有的主题。

与第一版一样，这本第二版侧重于介绍核心的在线学习算法，将更多的数学内容放在阴影框中。第一部分涵盖了尽可能多的强化学习内容，但没有超出可以找到精确解的表格形式的情况。这一部分介绍的很多算法都是第二版新增的，包括 UCB，Expected Sarsa，和 Double Learning。第二部分将这些思想扩展到函数逼近的情况，增加了一些新的章节，涉及到人工神经网络和傅里叶基等主题，并对离策略学习和策略梯度方法进行了更深入的讨论。第三部分增加了一些新的章节，探讨了强化学习与心理学和神经科学的关系，以及更新了一些案例研究，包括 AlphaGo 和 AlphaGo Zero，Atari 游戏，和 IBM Watson 的下注策略。最后一章讨论了强化学习对未来社会的影响。

赞誉：

目录 Contents

前言（第二版） Preface to the Second Edition xiii

前言（第一版） Preface to the First Edition xvii

符号说明 Summary of Notation xix

第一章引言 Introduction 1

1.1 强化学习 Reinforcement Learning 1

1.2 例子 Examples 4

1.3 强化学习的要素 Elements of Reinforcement Learning 6

1.4 局限性和范围 Limitations and Scope 7

1.5 一个扩展的例子：井字棋 An Extended Example: Tic-Tac-Toe 8

1.6 总结 Summary 13

1.7 强化学习的早期历史 Early History of Reinforcement Learning 13

第一部分表格型解法方法 I Tabular Solution Methods 23

第二章多臂老虎机 Multi-armed Bandits 25

2.1 k-臂老虎机问题 A k-armed Bandit Problem 25

2.2 动作价值方法 Action-value Methods 27

2.3 10-臂测试台 The 10-armed Testbed 28

2.4 增量实现 Incremental Implementation 30

2.5 跟踪非平稳问题 Tracking a Nonstationary Problem 32

2.6 乐观初始值 Optimistic Initial Values 34

2.7 上置信界行动选择 Upper-Confidence-Bound Action Selection 35

2.8 梯度赌博机算法 Gradient Bandit Algorithms 37

2.9 关联搜索（情景赌博机） Associative Search (Contextual Bandits) 41

2.10 本章总结 Summary 42

第三章有限马尔可夫决策过程 Finite Markov Decision Processes 47

3.1 智能体-环境接口 The Agent--Environment Interface 47

3.2 目标和奖励 Goals and Rewards 53

3.3 回报和片段 Returns and Episodes 54

3.4 统一表示法：持续型和片段型任务 Unified Notation for Episodic and Continuing Tasks 57

3.5 策略和价值函数 Policies and Value Functions 58

3.6 最优策略和最优价值函数 Optimal Policies and Optimal Value Functions 62

3.7 最优性和近似 Optimality and Approximation 67

3.8 本章总结 Summary 68

第四章动态规划 Dynamic Programming 73

4.1 策略评估（预测） Policy Evaluation (Prediction) 74

4.2 策略改进 Policy Improvement 76

4.3 策略迭代 Policy Iteration 80

4.4 价值迭代 Value Iteration 82

4.5 异步动态规划 Asynchronous Dynamic Programming 85

4.6 广义策略迭代 Generalized Policy Iteration 86

4.7 动态规划的效率 Efficiency of Dynamic Programming 87

4.8 本章总结 Summary 88

第五章蒙特卡罗方法 Monte Carlo Methods 91

5.1 蒙特卡罗预测 Monte Carlo Prediction 92

5.2 蒙特卡罗估计动作价值 Monte Carlo Estimation of Action Values 96

5.3 蒙特卡罗控制 Monte Carlo Control 97

5.4 无探索起始的蒙特卡罗控制 Monte Carlo Control without Exploring Starts 100

5.5 通过重要性采样的离策略预测 Off-policy Prediction via Importance Sampling 103

5.6 增量实现 Incremental Implementation 109

5.7 离策略蒙特卡罗控制 Off-policy Monte Carlo Control 110

5.8 考虑折扣的重要性采样 Discounting-aware Importance Sampling 112

5.9 每决策重要性采样 Per-decision Importance Sampling 114

5.10 本章总结 Summary 115

第六章时序差分学习 Temporal-Di↵erence Learning 119

6.1 TD预测 TD Prediction 119

6.2 TD预测方法的优势 Advantages of TD Prediction Methods 124

6.3 TD(0)的最优性 Optimality of TD(0) 126

6.4 Sarsa：基于策略的TD控制 Sarsa: On-policy TD Control 129

6.5 Q-learning：离策略的TD控制 Q-learning: Off-policy TD Control 131

6.6 期望Sarsa Expected Sarsa 133

6.7 最大化偏差和双重学习 Maximization Bias and Double Learning 134

6.8 博弈、后状态和其他特殊情况 Games, Afterstates, and Other Special Cases 136

6.9 本章总结 Summary 138

第七章 n步自举 n-step Bootstrapping 141

7.1 n步TD预测 n-step TD Prediction 142

7.2 n步Sarsa n-step Sarsa 145

7.3 n步离策略学习 n-step Off-policy Learning 148

7.4 每决策方法和控制变量 Per-decision Methods with Control Variates 150

7.5 无需重要性采样的离策略学习：n步树备份算法 Off-policy Learning Without Importance Sampling:The n-step Tree Backup Algorithm 152

7.6 一个统一的算法：n步Q(!) A Unifying Algorithm: n-step Q(!) 154

7.7 本章总结 Summary 157

第八章表格型方法的规划和学习 Planning and Learning with Tabular Methods 159

8.1 模型和规划 Models and Planning 159

8.2 Dyna：集成规划、行动和学习 Dyna: Integrated Planning, Acting, and Learning 161

8.3 当模型出错时 When the Model Is Wrong 166

8.4 优先级扫描 Prioritized Sweeping 168

8.5 期望更新 vs. 样本更新 Expected vs. Sample Updates 172

8.6 轨迹采样 Trajectory Sampling 174

8.7 实时动态规划 Real-time Dynamic Programming 177

8.8 决策时刻的规划 Planning at Decision Time 180

8.9 启发式搜索 Heuristic Search 181

8.10 展开算法 Rollout Algorithms 183

8.11 蒙特卡罗树搜索 Monte Carlo Tree Search 185

8.12 本章总结 Summary of the Chapter 188

8.13 第一部分总结：维度 Summary of Part I: Dimensions 189

第二部分近似解法方法 II Approximate Solution Methods 195

第九章基于策略的近似预测 On-policy Prediction with Approximation 197

9.1 价值函数近似 Value-function Approximation 198

9.2 预测目标（VE） The Prediction Objective (VE) 199

9.3 随机梯度和半梯度方法 Stochastic-gradient and Semi-gradient Methods 200

9.4 线性方法 Linear Methods 204

9.5 线性方法的特征构造 Feature Construction for Linear Methods 210

9.5.1 多项式 Polynomials 210

9.5.2 傅里叶基 Fourier Basis 211

9.5.3 粗编码 Coarse Coding 215

9.5.4 平铺编码 Tile Coding 217

9.5.5 径向基函数 Radial Basis Functions 221

9.6 手动选择步长参数 Selecting Step-Size Parameters Manually 222

9.7 非线性函数近似：人工神经网络 Nonlinear Function Approximation: Artificial Neural Networks 223

9.8 最小二乘TD Least-Squares TD 228

9.9 基于记忆的函数近似 Memory-based Function Approximation 230

9.10 基于核的函数近似 Kernel-based Function Approximation 232

9.11 深入探究基于策略的学习：兴趣和强调 Looking Deeper at On-policy Learning: Interest and Emphasis 234

9.12 本章总结 Summary 236

第十章基于策略的近似控制 On-policy Control with Approximation 243

10.1 片段半梯度控制 Episodic Semi-gradient Control 243

10.2 半梯度n步Sarsa Semi-gradient n-step Sarsa 247

10.3 平均奖励：持续型任务的新问题设定 Average Reward: A New Problem Setting for Continuing Tasks 249

10.4 废弃折扣设定 Deprecating the Discounted Setting 253

10.5 差分半梯度n步Sarsa Differential Semi-gradient n-step Sarsa 255

10.6 本章总结 Summary 256

第十一章离策略的近似方法 Off-policy Methods with Approximation 257

11.1 半梯度方法 Semi-gradient Methods 258

11.2 离策略发散的例子 Examples of Off-policy Divergence 260

11.3 致命三角 The Deadly Triad 264

11.4 线性价值函数几何 Linear Value-function Geometry 266

11.5 贝尔曼误差的梯度下降 Gradient Descent in the Bellman Error 269

11.6 贝尔曼误差是不可学习的 The Bellman Error is Not Learnable 274

11.7 梯度TD方法 Gradient-TD Methods 278

11.8 强调TD方法 Emphatic-TD Methods 281

11.9 降低方差 Reducing Variance 283

11.10 本章总结 Summary 284

第十二章资格迹 Eligibility Traces 287

12.1 λ-回报 The λ-return 288

12.2 TD(λ) TD(λ) 292

12.3 n步截断λ-回报方法 n-step Truncated λ-return Methods 295

12.4 重做更新：在线λ-回报算法 Redoing Updates: Online λ-return Algorithm 297

12.5 真在线TD(λ) True Online TD(λ) 299

12.6 蒙特卡罗学习中的荷兰迹 Dutch Traces in Monte Carlo Learning 301

12.7 Sarsa(λ) Sarsa(λ) 303

12.8 变化的λ和γ Variable λ and γ 307

12.9 带控制变量的离策略迹 Off-policy Traces with Control Variates 309

12.10 Watkins的Q(λ)到树备份(γ) Watkins's Q(λ) to Tree-Backup(γ) 312

12.11 稳定的带迹的离策略方法 Stable Off-policy Methods with Traces 314

12.12 实现问题 Implementation Issues 316

12.13 结论 Conclusions 317

第十三章策略梯度方法 Policy Gradient Methods 321

13.1 策略近似及其优势 Policy Approximation and its Advantages 322

13.2 策略梯度定理 The Policy Gradient Theorem 324

13.3 REINFORCE：蒙特卡罗策略梯度 REINFORCE: Monte Carlo Policy Gradient 326

13.4 带基线的REINFORCE REINFORCE with Baseline 329

13.5 行动者-评论者方法 Actor--Critic Methods 331

13.6 持续问题的策略梯度 Policy Gradient for Continuing Problems 333

13.7 连续行动的策略参数化 Policy Parameterization for Continuous Actions 335

13.8 本章总结 Summary 337

第三部分深入探究 III Looking Deeper 339

第十四章心理学 Psychology 341

14.1 预测和控制 Prediction and Control 342

14.2 古典条件作用 Classical Conditioning 343

14.2.1 阻塞和高阶条件作用 Blocking and Higher-order Conditioning 345

14.2.2 Rescorla--Wagner模型 The Rescorla--Wagner Model 346

14.2.3 TD模型 The TD Model 349

14.2.4 TD模型的仿真 TD Model Simulations 350

14.3 工具性条件作用 Instrumental Conditioning 357

14.4 延迟强化 Delayed Reinforcement 361

14.5 认知地图 Cognitive Maps 363

14.6 习惯性和目标导向的行为 Habitual and Goal-directed Behavior 364

14.7 本章总结 Summary 368

第十五章神经科学 Neuroscience 377

15.1 神经科学基础 Neuroscience Basics 378

15.2 奖励信号、强化信号、价值和预测误差 Reward Signals, Reinforcement Signals, Values, and Prediction Errors 380

15.3 奖励预测误差假说 The Reward Prediction Error Hypothesis 381

15.4 多巴胺 Dopamine 383

15.5 奖励预测误差假说的实验支持 Experimental Support for the Reward Prediction Error Hypothesis 387

15.6 TD误差/多巴胺对应 TD Error/Dopamine Correspondence 390

15.7 神经行动者-评论者 Neural Actor--Critic 395

15.8 行动者和评论者的学习规则 Actor and Critic Learning Rules 398

15.9 快乐神经元 Hedonistic Neurons 402

15.10 集体强化学习 Collective Reinforcement Learning 404

15.11 大脑中的基于模型的方法 Model-based Methods in the Brain 407

15.12 瘾 Addiction 409

15.13 本章总结 Summary 410

第十六章应用和案例研究 Applications and Case Studies 421

16.1 TD-Gammon TD-Gammon 421

16.2 Samuel的跳棋玩家 Samuel's Checkers Player 426

16.3 Watson的每日双倍赌注 Watson's Daily-Double Wagering 429

16.4 优化记忆控制 Optimizing Memory Control 432

16.5 人类水平的视频游戏 Human-level Video Game Play 436

16.6 掌握围棋 Mastering the Game of Go 441

16.6.1 AlphaGo AlphaGo 444

16.6.2 AlphaGo Zero AlphaGo Zero 447

16.7 个性化网络服务 Personalized Web Services 450

16.8 热气流滑翔 Thermal Soaring 453

第十七章前沿 Frontiers 459

17.1 一般价值函数和辅助任务 General Value Functions and Auxiliary Tasks 459

17.2 通过选项的时间抽象 Temporal Abstraction via Options 461

17.3 观测和状态 Observations and State 464

17.4 设计奖励信号 Designing Reward Signals 469

17.5 剩余问题 Remaining Issues 472

17.6 强化学习和人工智能的未来 Reinforcement Learning and the Future of Artificial Intelligence 475

参考文献 References 481

索引 Index 519