为什么静态3DGS+轨迹回放,可以通过强化学习训练端到端自动驾驶?

我们一般理解为static 3DGS 是背景,轨迹回放时,障碍物是无法交互的。但是这两篇论文仍然进行了RL强化学习。

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

ParkingWorld: End-to-End Autonomous Parking Reinforcement Learning from Corrective Experience in 3DG

我选择RAD的奖励模型进行分析:

3.4 奖励建模

奖励是训练信号的来源,决定了强化学习(RL)的优化方向。奖励函数的设计旨在通过惩罚不安全行为并鼓励与专家轨迹保持一致来引导自车的行为。它由四个奖励组成部分构成:(1) 与动态障碍物的碰撞,(2) 与静态障碍物的碰撞,(3) 相对于专家轨迹的位置偏差,以及 (4) 相对于专家轨迹的航向偏差:

R = { r d c , r s c , r p d , r h d } . ( 4 ) R = \{r_{dc}, r_{sc}, r_{pd}, r_{hd}\}. \quad (4) R={rdc,rsc,rpd,rhd}.(4)

如图 4 所示,这些奖励组成部分在特定条件下被触发。在 3DGS 环境中,如果自车的边界框与动态障碍物的标注边界框重叠,则检测到动态碰撞,并触发负奖励 r d c r_{dc} rdc。同样,当自车的边界框与静态障碍物的高斯分布重叠时,识别为静态碰撞,导致负奖励 r s c r_{sc} rsc。位置偏差被测量为自车当前位置与专家轨迹上最近点之间的欧几里得距离。超过预设阈值 d m a x d_{max} dmax 的偏差将产生负奖励 r p d r_{pd} rpd。航向偏差计算为自车当前航向角 ψ t \psi_t ψt 与专家轨迹匹配航向角 ψ e x p e r t \psi_{expert} ψexpert 之间的角度差。超过阈值 ψ m a x \psi_{max} ψmax 的偏差会导致负奖励 r h d r_{hd} rhd。

这些事件中的任何一项,包括动态碰撞、静态碰撞、过度的位置偏差或过度的航向偏差,都会立即导致回合终止。因为在发生此类事件后,3DGS 环境通常会产生噪声传感器数据,这对强化学习训练不利。

With the reward function, it is entirely possible to perform RL training in a:

text 复制代码
Static 3DGS Scene
       +
Trajectory Replay
       +
RL Agent

framework.

However, it is important to understand what kind of RL problem you are actually solving.


What your environment really is

Your environment dynamics are:

text 复制代码
Dynamic vehicles:
      fixed replay

Pedestrians:
      fixed replay

Traffic:
      fixed replay

Ego:
      controlled by RL

So:

s t + 1 = f ( s t , a t ) s_{t+1}=f(s_t,a_t) st+1=f(st,at)

still exists.

The ego vehicle's future state depends on its actions.

The difference is that:

other agents \text{other agents} other agents

do not react to the ego.

They follow prerecorded trajectories.


This is a valid MDP

Many papers call this:

  • Open-loop traffic replay
  • Log replay simulation
  • Reactive ego / non-reactive world

The RL agent can still learn:

text 复制代码
Steer
Brake
Accelerate

because:

text 复制代码
action
   ↓
ego trajectory changes
   ↓
reward changes

For example:

text 复制代码
Replay vehicle:
    -------------------->

Ego:
    accelerate

Result:

text 复制代码
collision

Reward:

text 复制代码
rdc < 0

The policy receives a training signal.


Why your reward design works

Your reward contains:

text 复制代码
rdc  Dynamic collision
rsc  Static collision
rpd  Position deviation
rhd  Heading deviation

These define a constrained corridor around the expert trajectory.

Effectively the RL agent learns:

text 复制代码
Stay safe
Stay near expert
Maintain correct heading

This is very similar to:

  • imitation RL
  • offline RL fine-tuning
  • autonomous-driving RL benchmarks

What the RL agent can learn

It can learn:

✓ smoother steering

✓ throttle control

✓ brake timing

✓ lane keeping

✓ trajectory tracking

✓ collision avoidance against replayed traffic


What it cannot learn(3DGS+回放训练RL存在局限性:反物理常识)

Suppose:

text 复制代码
Replay car:
    turns left

Your ego decides:

text 复制代码
stop completely

In reality:

text 复制代码
other vehicles react
pedestrians react
traffic evolves differently

But in replay:

text 复制代码
all agents continue exactly
as recorded

Therefore the RL agent never sees:

text 复制代码
counterfactual futures

Example

Dataset:

text 复制代码
Car A:
x=10
x=11
x=12
x=13

Recorded assuming:

text 复制代码
human driver accelerates

During RL:

text 复制代码
ego brakes hard

In the real world:

text 复制代码
Car A might also brake

In replay:

text 复制代码
Car A still:
10→11→12→13

This is physically inconsistent.

Yet RL can still train.


Why many autonomous-driving papers accept this

Because the objective is often:

text 复制代码
Learn a driving policy

not

text 复制代码
Build a perfect simulator

Replay-based environments provide:

  • real sensor data
  • realistic traffic
  • easy reward computation
  • stable training

without needing:

  • behavior prediction
  • traffic simulation
  • world models

The limitation

The strongest limitation is:

text 复制代码
No agent interaction

Your environment is approximately:

text 复制代码
World(t)

instead of:

text 复制代码
World(t, EgoAction)

A true world model learns:

W o r l d t + 1 = F ( W o r l d t , E g o A c t i o n t ) World_{t+1}=F(World_t,EgoAction_t) Worldt+1=F(Worldt,EgoActiont)

which enables:

  • negotiation
  • yielding
  • merging
  • interactive driving

In fact, your quoted text already reveals this

Notice:

"这些事件中的任何一项,包括动态碰撞、静态碰撞、过度的位置偏差或过度的航向偏差,都会立即导致回合终止。因为在发生此类事件后,3DGS 环境通常会产生噪声传感器数据,这对强化学习训练不利。"

This is a strong clue that the environment is fundamentally a rendering/replay system, not a generative world model.

When the ego leaves the distribution of the recorded trajectory:

text 复制代码
large deviation
collision
off-road

the rendered views become unreliable.

Therefore the episode terminates.

World-model simulators usually try to continue generating plausible futures after such deviations.


Conclusion

With the reward function you showed, RL training in a Static 3DGS + Trajectory Replay environment is completely feasible and is already an active research direction.

What you are learning is essentially:

text 复制代码
Policy Learning
inside
a replay-based simulator

rather than:

text 复制代码
Policy Learning
inside
an interactive world model

The main trade-off is:

Feature 3DGS + Replay World Model
Photorealism Excellent Often lower
Easy to build Yes No
RL training Yes Yes
Interactive traffic No Yes
Counterfactual futures No Yes
Closed-loop ego control Yes Yes
Multi-agent reaction No Yes

So the answer is yes, RL can absolutely be trained successfully in your proposed setup, and a world model is not strictly necessary. The world model becomes valuable when you want other agents and the environment to react realistically to the ego vehicle's actions rather than simply replaying recorded trajectories.

相关推荐
DO_Community1 小时前
LawVo 借助 DigitalOcean 扩展 AI 驱动的法律服务
人工智能
带娃的IT创业者1 小时前
从马斯克败诉OpenAI看AI时代的契约重构与技术伦理边界
人工智能·重构·openai·马斯克·商业化·技术伦理·ai治理
SCKJAI1 小时前
依托NVIDIA Cosmos 3赋能交通与安全智能升级,发布新一代智慧城市AI平台
人工智能
波动几何1 小时前
领域消除评估技能domain-elimination-assessor
人工智能
AI科技星1 小时前
依托Gε₀ = e²/(4παmₚ²)核心方程:全新公式推导+原创理论提炼+全维度精算验证
人工智能·线性代数·架构·概率论·学习方法
Henry Zhu1231 小时前
从人类智能的形成机制到通用人工智能:一种具身、因果、社会化的适应系统框架
人工智能
光锥智能1 小时前
把OpenAI按在地上摩擦,Anthropic怎么做到的?
大数据·人工智能
河南博为智能科技有限公司1 小时前
基于边缘计算物联网关的机房动力环境监控系统解决方案!
人工智能·物联网·边缘计算
二哈赛车手1 小时前
新人笔记---继图片搜索功能后续以及AI网络搜索功能一些经验与踩坑点,吐槽一下自己在做这方面的崩溃瞬间
java·网络·人工智能·spring boot·笔记·spring