【机器人】复现 Aether 世界模型 | 几何感知统一 ICCV 2025

Aether 是几何感知统一的世界模型，来自ICCV 2025，该框架具有三大核心功能：

(1) 4D动态重建，(2) 动作条件视频预测， (3) 目标条件视觉规划。

特点：全基于合成数据进行训练，实现了对真实场景的强大零样本泛化能力。

代码地址：https://github.com/OpenRobotLab/Aether

论文地址：Aether: Geometric-Aware Unified World Modeling

本文分享 Aether 复现和模型推理的过程～

下面是4D重建的示例：

下面是一个动作条件预测的示例：

下面是一个视觉规划的示例：

1、创建Conda环境

首先下载Aether代码，进行工程目录

复制代码

git clone https://github.com/OpenRobotLab/Aether.git
cd Aether

创建一个Conda环境，名字为aether，使用Python版本是3.10

bash 复制代码

conda create -n aether python=3.10
conda activate aether

2、安装Aether环境依赖库

执行下面命令，进行安装：

bash 复制代码

pip install -r requirements.txt

其中，requirements.txt中包含的依赖库：

accelerate>=1.2.1

coloredlogs>=15.0.1

colorlog>=6.9.0

diffusers>=0.32.2

easydict>=1.13

einops>=0.8.0

hf_transfer>=0.1.8

huggingface-hub>=0.27.1

imageio>=2.33.1

imageio-ffmpeg>=0.5.1

iopath>=0.1.10

matplotlib>=3.10.0

numpy>=1.26.4,<2.0.0

omegaconf>=2.3.0

opencv-python-headless>=4.10.0.84

pillow>=11.1.0

plotly>=5.24.1

plyfile>=1.1

pre_commit>=4.0.1

python-dotenv>=1.0.1

PyYAML>=6.0.2

rich>=13.9.4

rootutils>=1.0.7

safetensors>=0.5.2

scikit-image>=0.25.0

scipy>=1.15.0

sentencepiece>=0.2.0

six>=1.17.0

tokenizers>=0.21.0

torch>=2.5.1

torchaudio>=2.5.1

torchmetrics>=1.6.1

torchvision>=0.20.1

tqdm>=4.67.1

transformers>=4.48.0

triton>=3.1.0

typer>=0.15.1

typing_extensions>=4.12.2

viser>=0.2.23

filterpy

trimesh

gradio

安装成功打印信息：

(base) lgp@lgp-MS-7E07:~/2025_project/Aether$ conda activate aether

(aether) lgp@lgp-MS-7E07:~/2025_project/Aether$

(aether) lgp@lgp-MS-7E07:~/2025_project/Aether$ pip install -r requirements.txt

Successfully built iopath antlr4-python3-runtime filterpy

Installing collected packages: sentencepiece, pytz, pydub, nvidia-cusparselt-cu12, mpmath, easydict, distlib, antlr4-python3-runtime, zipp, xxhash, websockets, urllib3, tzdata, typing_extensions, triton, tqdm, tomlkit, sympy, svg.path, sniffio, six, shtab, shellingham, semantic-version, safetensors, ruff, rtree, rpds-py, regex, PyYAML, python-multipart, python-dotenv, pyparsing, pygments, psutil, portalocker, platformdirs, pillow, packaging, orjson, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu.......

.......

torch-2.7.1 torchaudio-2.7.1 torchmetrics-1.7.4 torchvision-0.22.1 tqdm-4.67.1 transformers-4.53.1 trimesh-4.6.13 triton-3.3.1 typeguard-4.4.4 typer-0.16.0 typing-inspection-0.4.1 typing_extensions-4.14.1 tyro-0.9.26 tzdata-2025.2 urllib3-2.5.0 uvicorn-0.35.0 vhacdx-0.0.8.post2 virtualenv-20.31.2 viser-1.0.0 websockets-15.0.1 xxhash-3.5.0 yourdfpy-0.0.58 zipp-3.23.0

(aether) lgp@lgp-MS-7E07:~/2025_project/Aether

再安装 protobuf，执行下面命令进行安装：

bash 复制代码

pip install protobuf==3.20.3

等待安装完成～

3、模型权重

需要两个模型权重：AetherWorldModel--AetherV1、THUDM--CogVideoX-5b-I2V，

如果直接运行推理，会自动下载的，默认在**～/.cache/huggingface/hub/** 目录下

11G ～/.cache/huggingface/hub/models--AetherWorldModel--AetherV1

9.7G ～/.cache/huggingface/hub/models--THUDM--CogVideoX-5b-I2V

权重地址：

https://huggingface.co/AetherWorldModel/AetherV1/tree/main

https://huggingface.co/THUDM/CogVideoX-5b/tree/main

4、本地运行推理

4D重建示例，执行下面命令：

bash 复制代码

python scripts/demo.py --task reconstruction --video ./assets/example_videos/moviegen.mp4

运行时，需要联网下载权重的，而且权重比较大，需要等待时间较长～

(aether) lgp@lgp-MS-7E07:~/2025_project/Aether$

(aether) lgp@lgp-MS-7E07:~/2025_project/Aether$ python scripts/demo.py --task reconstruction --video ./assets/example_videos/moviegen.mp4

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers

config.json: 782B [00:00, 2.79MB/s]

model.safetensors.index.json: 19.9kB [00:00, 3.38MB/s]

model-00002-of-00002.safetensors: 100%|█████████████████████████████████████████████████████| 4.53G/4.53G [04:08<00:00, 18.2MB/s]

model-00001-of-00002.safetensors: 100%|█████████████████████████████████████████████████████| 4.99G/4.99G [04:09<00:00, 20.0MB/s]

Fetching 2 files: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [04:10<00:00, 125.16s/it]

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.32it/s]

config.json: 872B [00:00, 3.16MB/s]

diffusion_pytorch_model.safetensors: 100%|████████████████████████████████████████████████████| 862M/862M [02:17<00:00, 6.26MB/s]

scheduler_config.json: 100%|████████████████████████████████████████████████████████████████████| 482/482 [00:00<00:00, 1.99MB/s]

config.json: 914B [00:00, 3.12MB/s]

........................

Building GLB scene

GLB Scene built

(aether) lgp@lgp-MS-7E07:~/2025_project/Aether

结果将保存在./outputs/

生成文件：

reconstruction_moviegen_disparity.mp4：

动作条件视频预测，执行下面命令：

bash 复制代码

python scripts/demo.py --task prediction --image ./assets/example_obs/car.png --raymap_action assets/example_raymaps/raymap_forward_right.npy

打印信息：

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.78it/s]

Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 18450.02it/s]

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 51.99it/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [04:08<00:00, 4.97s/it]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00, 2.49s/it]

Building GLB scene

GLB Scene built

生成文件：

目标条件视觉规划，执行下面命令：

bash 复制代码

python scripts/demo.py --task planning --image ./assets/example_obs_goal/01_obs.png --goal ./assets/example_obs_goal/01_goal.png

打印信息：

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.80it/s]

Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 18978.75it/s]

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 52.97it/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [04:07<00:00, 4.96s/it]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00, 2.49s/it]

Building GLB scene

GLB Scene built

生成文件：