深度学习入门(9) - Reinforcement Learning 强化学习

Reinforcement Learning

an agent performs actions in environment, and receives rewards

goal: Learn how to take actions that maximize reward

Stochasticity: Rewards and state transitions may be random

Credit assignment : Reward r t r_t rt may not directly depend on action a t a_t at

Nondifferentiable: Can't backprop through the world

Nonstationary: What the agent experiences depends on how it acts

Markov Decision Process (MDP)

Mathematical formalization of the RL problem: A tuple ( S , A , R , P , γ ) (S,A,R,P,\gamma) (S,A,R,P,γ)

S S S: Set of possible states

A A A: Set of possible actions

R R R: Distribution of reward given (state, action) pair

P P P: Transition probability: distribution over next state given (state, action)

γ \gamma γ: Discount factor (trade-off between future and present rewards)

Markov Property: The current state completely characterizes the state of the world. Rewards and next states depend only on current state, not history.

Agent executes a policy π \pi π giving distribution of actions conditioned on states.

Goal : Find best policy that maximizes cumulative discounted reward ∑ t γ t r t \sum_t \gamma^tr_t ∑tγtrt

We will try to find the maximal expected sum of rewards to reduce the randomness.

Value function V π ( s ) V^{\pi}(s) Vπ(s): expected cumulative reward from following policy π \pi π from state s s s

Q function Q π ( s , a ) Q^{ \pi}(s,a) Qπ(s,a) : expected cumulative reward from following policy π \pi π from taking action a a a in state s s s

Bellman Equation

After taking action a in state s, we get reward r and move to a new state s'. After that, the max possible reward we can get is max ⁡ a ′ Q ∗ ( s ′ , a ′ ) \max_{a'} Q^*(s',a') maxa′Q∗(s′,a′)

Idea: find a function that satisfy Bellman equation then it must be optimal

start with a random Q, and use Bellman equation as an update rule.

But if the state is large/infinite, we can't iterate them.

Approximate Q(s, a) with a neural network, use Bellman equation as loss function.

-> Deep q learning

Policy Gradients

Train a network π θ ( a , s ) \pi_{\theta}(a,s) πθ(a,s) that takes state as input, gives distribution over which action to take

Objective function: Expected future rewards when following policy π θ \pi_{\theta} πθ

Use gradient ascent -> play some tricks to make it differentiable

Other approaches:

Actor-Critic

Model-Based

Imitation Learning

Inverse Reinforcement Learning

Adversarial Learning

...

Stochastic computation graphs

相关推荐
Francek Chen6 分钟前
【HarmonyOS 6 特别发布】鸿蒙 6 正式登场:功能升级,构建跨设备安全流畅新生态
人工智能·华为·harmonyos·harmonyos 6
kalvin_y_liu8 分钟前
【“具身智能”AI烹饪机器人系统 - 外委研发课题清单】
人工智能·具身智能
是Dream呀9 分钟前
PRCV 2025:文本何以成为 AGI 的必经之路?
图像处理·人工智能·aigc·agi·多模态·合合信息
王嘉俊92514 分钟前
HarmonyOS 分布式与 AI 集成:构建智能协同应用的进阶实践
人工智能·分布式·harmonyos
StarPrayers.14 分钟前
CNN 模型搭建与训练:PyTorch 实战 CIFAR10 任务
人工智能·pytorch·cnn
赋创小助手14 分钟前
实测对比 32GB RTX 5090 与 48GB RTX 4090,多场景高并发测试,全面解析 AI 服务器整机性能与显存差异。
运维·服务器·人工智能·科技·深度学习·神经网络·自然语言处理
阿水实证通15 分钟前
能源经济选题推荐:可再生能源转型政策如何提高能源韧性?基于双重机器学习的因果推断
人工智能·机器学习·能源
掘金安东尼29 分钟前
大模型嵌入浏览器,Atlas 和 Gemini 将带来怎样的变革?
人工智能
亚马逊云开发者33 分钟前
基于Amazon Bedrock的TwelveLabs Marengo Embed 2.7多模态搜索系统
人工智能
Geoking.34 分钟前
深度学习基础:Tensor(张量)的创建方法详解
人工智能·深度学习