深度学习入门(9) - Reinforcement Learning 强化学习

Reinforcement Learning

an agent performs actions in environment, and receives rewards

goal: Learn how to take actions that maximize reward

Stochasticity: Rewards and state transitions may be random

Credit assignment : Reward r t r_t rt may not directly depend on action a t a_t at

Nondifferentiable: Can't backprop through the world

Nonstationary: What the agent experiences depends on how it acts

Markov Decision Process (MDP)

Mathematical formalization of the RL problem: A tuple ( S , A , R , P , γ ) (S,A,R,P,\gamma) (S,A,R,P,γ)

S S S: Set of possible states

A A A: Set of possible actions

R R R: Distribution of reward given (state, action) pair

P P P: Transition probability: distribution over next state given (state, action)

γ \gamma γ: Discount factor (trade-off between future and present rewards)

Markov Property: The current state completely characterizes the state of the world. Rewards and next states depend only on current state, not history.

Agent executes a policy π \pi π giving distribution of actions conditioned on states.

Goal : Find best policy that maximizes cumulative discounted reward ∑ t γ t r t \sum_t \gamma^tr_t ∑tγtrt

We will try to find the maximal expected sum of rewards to reduce the randomness.

Value function V π ( s ) V^{\pi}(s) Vπ(s): expected cumulative reward from following policy π \pi π from state s s s

Q function Q π ( s , a ) Q^{ \pi}(s,a) Qπ(s,a) : expected cumulative reward from following policy π \pi π from taking action a a a in state s s s

Bellman Equation

After taking action a in state s, we get reward r and move to a new state s'. After that, the max possible reward we can get is max ⁡ a ′ Q ∗ ( s ′ , a ′ ) \max_{a'} Q^*(s',a') maxa′Q∗(s′,a′)

Idea: find a function that satisfy Bellman equation then it must be optimal

start with a random Q, and use Bellman equation as an update rule.

But if the state is large/infinite, we can't iterate them.

Approximate Q(s, a) with a neural network, use Bellman equation as loss function.

-> Deep q learning

Policy Gradients

Train a network π θ ( a , s ) \pi_{\theta}(a,s) πθ(a,s) that takes state as input, gives distribution over which action to take

Objective function: Expected future rewards when following policy π θ \pi_{\theta} πθ

Use gradient ascent -> play some tricks to make it differentiable

Other approaches:

Actor-Critic

Model-Based

Imitation Learning

Inverse Reinforcement Learning

Adversarial Learning

...

Stochastic computation graphs

相关推荐
艾派森10 分钟前
大数据分析案例-基于随机森林算法的智能手机价格预测模型
人工智能·python·随机森林·机器学习·数据挖掘
hairenjing112312 分钟前
在 Android 手机上从SD 卡恢复数据的 6 个有效应用程序
android·人工智能·windows·macos·智能手机
小蜗子17 分钟前
Multi‐modal knowledge graph inference via media convergenceand logic rule
人工智能·知识图谱
SpikeKing29 分钟前
LLM - 使用 LLaMA-Factory 微调大模型 环境配置与训练推理 教程 (1)
人工智能·llm·大语言模型·llama·环境配置·llamafactory·训练框架
黄焖鸡能干四碗1 小时前
信息化运维方案,实施方案,开发方案,信息中心安全运维资料(软件资料word)
大数据·人工智能·软件需求·设计规范·规格说明书
1 小时前
开源竞争-数据驱动成长-11/05-大专生的思考
人工智能·笔记·学习·算法·机器学习
ctrey_1 小时前
2024-11-4 学习人工智能的Day21 openCV(3)
人工智能·opencv·学习
攻城狮_Dream1 小时前
“探索未来医疗:生成式人工智能在医疗领域的革命性应用“
人工智能·设计·医疗·毕业
学习前端的小z2 小时前
【AIGC】如何通过ChatGPT轻松制作个性化GPTs应用
人工智能·chatgpt·aigc
埃菲尔铁塔_CV算法2 小时前
人工智能图像算法:开启视觉新时代的钥匙
人工智能·算法