GR00T N1.7源码学习(一):工程入口、模型结构与动作生成流程解析-CSDN博客
上一篇博客主要把GR00T N1.7的主模型、动作头、Flow Matching训练目标和推理阶段的动作生成过程过了一遍。本文关注的是一条原始机器人轨迹是怎样从LeRobot数据集目录中被读出来,然后一步步变成Gr00tN1d7.forward()可以直接使用的输入字典。源码主要涉及下面几个文件,
gr00t/data/dataset/lerobot_episode_loader.py
gr00t/data/dataset/sharded_single_step_dataset.py
gr00t/data/types.py
gr00t/data/state_action/state_action_processor.py
gr00t/model/gr00t_n1d7/processing_gr00t_n1d7.py
gr00t/configs/data/embodiment_configs.py
examples/SO100/modality.json
examples/LIBERO/modality.json
1、LeRobot目录、modality.json和ModalityConfig共同定义数据语义
GR00T N1.7没有重新定义一套完全独立的数据格式,而是复用了LeRobot风格的数据目录。一个典型数据集目录大致如下,
dataset_root/
├── data/
├── videos/
└── meta/
├── info.json
├── episodes.jsonl
├── tasks.jsonl
├── modality.json
├── stats.json
└── relative_stats.json
data下面通常是parquet数据,保存低维状态、动作、时间戳、任务索引等信息;videos下面保存相机视频;meta下面则是整个数据集的描述文件。GR00T自己的加载逻辑集中在LeRobotEpisodeLoader中,源码开头把几个标准文件名写成了常量,
LEROBOT_META_DIR_NAME = "meta"
LEROBOT_INFO_FILENAME = "info.json"
LEROBOT_EPISODES_FILENAME = "episodes.jsonl"
LEROBOT_TASKS_FILENAME = "tasks.jsonl"
LEROBOT_MODALITY_FILENAME = "modality.json"
LEROBOT_STATS_FILE_NAME = "stats.json"
LEROBOT_RELATIVE_STATS_FILE_NAME = "relative_stats.json"
ALLOWED_MODALITIES = ["video", "state", "action", "language", "mask"]
DEFAULT_COLUMN_NAMES = {
"state": "observation.state",
"action": "action",
}
LANG_KEYS = ["task", "sub_task"]
其中stats.json是强依赖文件,不存在会直接报错,
stats_path = meta_dir / LEROBOT_STATS_FILE_NAME
assert stats_path.exists(), (
f"{stats_path} does not exist for {self.dataset_path}, please use gr00t/data/stats.py to generate it"
)
with open(stats_path, "r") as f:
self.stats = json.load(f)
说明GR00T训练前必须准备好状态和动作统计量。状态、动作进入模型前都要归一化,模型输出动作后还要反归一化,如果统计量缺失,后面的Processor就没有办法正确工作。
如果数据集中存在relative_stats.json,源码会继续读取,并把它放到self.stats"relative_action"下面,
relative_stats_path = meta_dir / LEROBOT_RELATIVE_STATS_FILE_NAME
if relative_stats_path.exists():
with open(relative_stats_path, "r") as f:
relative_stats = json.load(f)
relative_stats.pop("__fingerprints__", None)
self.stats["relative_action"] = relative_stats
modality.json负责把原始数据字段映射成模型使用的模态。以examples/SO100/modality.json为例,
{
"state": {
"single_arm": {
"start": 0,
"end": 5
},
"gripper": {
"start": 5,
"end": 6
}
},
"action": {
"single_arm": {
"start": 0,
"end": 5
},
"gripper": {
"start": 5,
"end": 6
}
},
"video": {
"front": {
"original_key": "observation.images.front"
},
"wrist": {
"original_key": "observation.images.wrist"
}
},
"annotation": {
"human.task_description": {
"original_key": "task_index"
}
}
}
把原始状态向量切成single_arm和gripper两组,动作也按同样方式切分;图像从observation.images.front和observation.images.wrist读取,并重命名为front和wrist。后续代码里看到的state.single_arm、state.gripper、video.front、video.wrist,本质上都是从这个配置来的。
modality.json偏数据集侧,而训练时真正使用的是ModalityConfig。它定义在gr00t/data/types.py中,
@dataclass
class ModalityConfig:
"""Configuration for a modality defining how data should be sampled and loaded."""
delta_indices: list[int]
"""Delta indices to sample relative to the current index."""
modality_keys: list[str]
"""The keys to load for the modality in the dataset."""
sin_cos_embedding_keys: list[str] | None = None
mean_std_embedding_keys: list[str] | None = None
action_configs: list[ActionConfig] | None = None
delta_indices描述相对于当前step要取哪些时间位置。例如视频配置为-15, 0,表示取过去一帧和当前帧;状态配置为0,表示只取当前状态;动作配置为list(range(16)),表示从当前step开始取未来16步动作,构成Action Chunk。
内置机器人配置保存在gr00t/configs/data/embodiment_configs.py。以oxe_droid_relative_eef_relative_joint为例,
"oxe_droid_relative_eef_relative_joint": {
"video": ModalityConfig(
delta_indices=[-15, 0],
modality_keys=["exterior_image_1_left", "wrist_image_left"],
),
"state": ModalityConfig(
delta_indices=[0],
modality_keys=["eef_9d", "gripper_position", "joint_position"],
),
"action": ModalityConfig(
delta_indices=list(range(40)),
modality_keys=["eef_9d", "gripper_position", "joint_position"],
action_configs=[
ActionConfig(
rep=ActionRepresentation.RELATIVE,
type=ActionType.EEF,
format=ActionFormat.XYZ_ROT6D,
state_key="eef_9d",
),
ActionConfig(
rep=ActionRepresentation.ABSOLUTE,
type=ActionType.NON_EEF,
format=ActionFormat.DEFAULT,
state_key="gripper_position",
),
ActionConfig(
rep=ActionRepresentation.RELATIVE,
type=ActionType.NON_EEF,
format=ActionFormat.DEFAULT,
state_key="joint_position",
),
],
),
"language": ModalityConfig(
delta_indices=[0],
modality_keys=["annotation.language.language_instruction"],
),
}
use_relative_action=True并不等于所有动作都做相对表示。是否使用相对动作,还要看每个动作组自己的ActionConfig。例如eef_9d是相对末端动作,joint_position是相对关节动作,而gripper_position保持绝对动作。
2、Dataset把完整episode拆成模型训练样本
LeRobotEpisodeLoader负责按episode读取完整轨迹。它的__getitem__返回的不是单个训练样本,而是一个完整episode对应的DataFrame,
def __getitem__(self, idx: int) -> pd.DataFrame:
if idx < 0 or idx >= len(self):
raise IndexError(f"Episode index {idx} out of bounds")
episode_meta = self.episodes_metadata[idx]
episode_id = episode_meta["episode_index"]
nominal_length = episode_meta["length"]
# Load and parse the parquet data
df = self._load_parquet_data(episode_id)
if "language" in self.modality_configs:
lang_key = self.modality_configs["language"].modality_keys[0]
if lang_key in LANG_KEYS:
new_languages = self.create_language_from_meta(episode_meta, len(df), lang_key)
df["language." + lang_key] = new_languages
actual_length = min(len(df), nominal_length)
df = df.iloc[:actual_length]
# Load synchronized video data
video_data = self._load_video_data(episode_id, np.arange(actual_length))
for key in video_data.keys():
assert len(video_data[key]) == len(df), (
f"Video data for {key} has length {len(video_data[key])} but dataframe has length {len(df)}"
)
df[f"video.{key}"] = [frame for frame in video_data[key]]
return df
parquet里通常已经包含状态和动作,图像则通过_load_video_data从视频文件中取帧,并写入DataFrame。语言字段也会在这里做统一处理,如果配置的是task或sub_task,代码会从metadata中生成每一帧对应的语言文本。最终DataFrame列名大致会变成,
state.xxx
action.xxx
video.front
video.wrist
language.task
mask.xxx
完整episode还不能直接喂给模型,后面要由ShardedSingleStepDataset拆成以某个step为中心的训练样本。该类定义在,
gr00t/data/dataset/sharded_single_step_dataset.py
初始化时会根据动作配置计算action horizon,
action_delta_indices = modality_configs["action"].delta_indices
self.action_horizon = max(action_delta_indices) - min(action_delta_indices) + 1
有效episode长度会扣掉尾部凑不齐未来动作块的step,
def get_effective_episode_length(self, episode_index: int) -> int:
original_length = self.episode_loader.get_episode_length(episode_index)
return max(0, original_length - self.action_horizon + 1)
假设一条episode长度是100,动作块长度是16,那么可作为训练起点的step最多只有85个,因为从后面的step开始已经凑不齐未来16步动作。取单个训练样本时,会调用extract_step_data,
def get_datapoint(self, episode_data: pd.DataFrame, step_index: int) -> dict:
assert self.processor is not None, "Processor must be set before getting datapoints"
vla_step_data = extract_step_data(
episode_data,
step_index,
self.modality_configs,
self.embodiment_tag,
self.allow_padding,
)
messages = [{"type": MessageType.EPISODE_STEP.value, "content": vla_step_data}]
return self.processor(messages)
extract_step_data按照delta_indices抽取每种模态的数据,
def extract_step_data(
episode_data: pd.DataFrame,
step_index: int,
modality_configs: dict[str, ModalityConfig],
embodiment_tag: EmbodimentTag,
allow_padding: bool = False,
) -> VLAStepData:
step_data = {}
for modality, config in modality_configs.items():
step_data[modality] = {}
indices_to_load = [step_index + delta_index for delta_index in config.delta_indices]
if allow_padding:
indices_to_load = [max(0, min(idx, len(episode_data) - 1)) for idx in indices_to_load]
for key in config.modality_keys:
if f"{modality}.{key}" in episode_data.columns:
modality_data = episode_data[f"{modality}.{key}"].iloc[indices_to_load]
else:
raise KeyError(
f"{modality}.{key} not found in episode data, available keys: {episode_data.columns}"
)
if modality in ["state", "action"]:
step_data[modality][key] = np.vstack(
[
np.array(modality_data.iloc[i]).astype(np.float32)
for i in range(len(modality_data))
]
)
else:
step_data[modality][key] = modality_data.tolist()
例如当前step为100,动作delta_indices=list(range(16)),就会读取100到115这16个动作;视频delta_indices=-15, 0,就会读取第85帧和第100帧;状态delta_indices=0,就只取当前状态。
最后这些数据会被封装成VLAStepData,
vla_step_data = VLAStepData(
images=video_data,
masks=mask_data if mask_data else None,
states=state_data,
actions=action_data,
text=text,
embodiment=embodiment_tag,
)
return vla_step_data
VLAStepData是Dataset和Processor之间的中间结构,
@dataclass
class VLAStepData:
"""
Represents a single step of VLA (Vision-Language-Action) data.
"""
images: dict[str, list[np.ndarray]]
states: dict[str, np.ndarray]
actions: dict[str, np.ndarray]
masks: dict[str, list[np.ndarray]] | None = None
text: str | None = None
embodiment: EmbodimentTag = EmbodimentTag.NEW_EMBODIMENT
is_demonstration: bool = False
metadata: dict[str, Any] = field(default_factory=dict)
3、Gr00tN1d7Processor统一状态动作、图像语言和embodiment输入
Gr00tN1d7Processor定义在,
gr00t/model/gr00t_n1d7/processing_gr00t_n1d7.py
它负责把VLAStepData整理成模型输入。初始化时会保存模态配置、统计量配置、图像处理配置、VLM Processor配置和embodiment_id映射,
class Gr00tN1d7Processor(BaseProcessor):
data_collator_class = Gr00tN1d7DataCollator
def __init__(
self,
modality_configs: dict[str, dict[str, ModalityConfig]],
statistics: dict[str, dict[str, dict[str, dict[str, list[float]]]]] | None = None,
use_percentiles: bool = False,
clip_outliers: bool = True,
model_name: str = "nvidia/Cosmos-Reason2-2B",
model_type: str = "qwen",
max_state_dim: int = 29,
max_action_dim: int = 29,
max_action_horizon: int = 50,
use_relative_action: bool = False,
embodiment_id_mapping: dict[str, int] | None = None,
exclude_state: bool = False,
state_dropout_prob: float = 0.0,
use_mean_std: bool = False,
...
):
self.modality_configs = parse_modality_configs(modality_configs)
状态和动作的处理没有直接写在Processor里,而是交给StateActionProcessor,
self.state_action_processor = StateActionProcessor(
modality_configs=modality_configs,
statistics=statistics,
use_percentiles=use_percentiles,
clip_outliers=clip_outliers,
apply_sincos_state_encoding=apply_sincos_state_encoding,
use_relative_action=use_relative_action,
)
Gr00tN1d7Processor负责完整多模态样本,包括图像、语言、状态、动作、Mask和VLM输入;StateActionProcessor专门负责低维状态动作,包括归一化、相对动作转换、sin/cos编码和反归一化。
Processor内部还会创建Qwen3-VL Processor,
self.processor = build_processor(model_name, transformers_loading_kwargs)
self.processor.tokenizer.padding_side = "left"
训练入口__call__()先取出VLAStepData内容,然后把状态和动作交给StateActionProcessor,
def __call__(
self,
messages: list[dict[str, Any]],
):
assert len(messages) == 1
content = messages[0]["content"]
embodiment_tag = content.embodiment
action_data = content.actions
state_data = content.states
norm_state_dict, normalized_actions = self.state_action_processor.apply(
state=state_data,
action=action_data,
embodiment_tag=embodiment_tag.value,
)
StateActionProcessor.apply()内部先处理状态,再处理动作,
def apply(
self,
state: dict[str, np.ndarray],
action: dict[str, np.ndarray],
embodiment_tag: str,
) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]:
"""
Apply both state and action processing together.
"""
processed_state = self.apply_state(state, embodiment_tag)
if action:
processed_action = self.apply_action(action, embodiment_tag, state=state)
else:
assert not self.training, "Action is required in training mode"
processed_action = {}
return processed_state, processed_action
相对动作转换需要原始state作为参考,如果先把state处理成归一化后的值,再拿它去计算相对动作,就会改变物理含义。因此动作处理函数拿到的是原始state,而不是归一化后的state。
图像和语言在Processor中也会被整理成VLM输入。训练时根据self.training选择图像增强,
if self.training:
image_transform = self.train_image_transform
else:
image_transform = self.eval_image_transform
image_keys = self.modality_configs[embodiment_tag.value]["video"].modality_keys
语言默认会做小写和符号清理,
if self.formalize_language:
language = content.text.lower()
language = re.sub(r"[^\w\s]", "", language)
else:
language = content.text
图像会先按view和时间堆叠,
for view in image_keys:
assert view in images, f"{view} not in {images}"
temporal_stacked_images[view] = torch.stack(
[image_transform(img) for img in images[view]]
) # (T, C, H, W)
然后把时间和相机两个维度展平,交给VLM Processor,
stacked_images = (
torch.stack([temporal_stacked_images[view] for view in image_keys], dim=1)
.flatten(0, 1)
.numpy()
) # (T*V, C, H, W)
如果一个样本有2个时间点、2个相机,那么最终传给VLM Processor的是4张图。这里没有单独为每张图增加显式结构字段,时间和相机信息主要通过图像排列顺序以及prompt组织方式体现。
Batch级别真正调用Qwen3-VL Processor的是Gr00tN1d7DataCollator。单样本Processor返回的VLM部分是vlm_content,Collator会统一抽取文本和图像,
if key == "vlm_content":
text_list = []
image_inputs = []
for v in values:
curr_text_list = [v["text"]]
text_list += curr_text_list
curr_image_inputs = v["images"]
image_inputs += curr_image_inputs
vlm_inputs = self.processor(
text=text_list,
images=image_inputs,
return_tensors="pt",
padding=True,
)
for k, v in vlm_inputs.items():
batch[k] = v
低维状态、动作和Mask已经在单样本阶段补齐成固定形状,因此可以直接stack,
else:
batch[key] = torch.from_numpy(np.stack(values))
return BatchFeature(data={"inputs": batch})
最终模型拿到的输入大致是,
{
"inputs": {
"input_ids": ...,
"attention_mask": ...,
"pixel_values": ...,
"image_grid_thw": ...,
"state": ...,
"action": ...,
"action_mask": ...,
"embodiment_id": ...,
}
}
4、StateActionProcessor处理归一化、相对动作和Mask
StateActionProcessor是关键的低维数据处理模块。它的职责在类注释中已经写得比较清楚,
class StateActionProcessor:
"""
Unified processor for robot state and action data.
Handles:
- State normalization (min/max, mean/std, sin/cos encoding)
- Action normalization
- Absolute <-> Relative action representation conversion
- Action processing with state dependency
"""
状态处理函数是apply_state()。它会遍历当前机器人配置中的每个state group,然后根据配置选择归一化方式,
def apply_state(
self,
state: dict[str, np.ndarray],
embodiment_tag: str,
) -> dict[str, np.ndarray]:
normalized_values = {}
state_config = self.modality_configs[embodiment_tag]["state"]
sin_cos_keys = set()
if self.apply_sincos_state_encoding and hasattr(state_config, "sin_cos_embedding_keys"):
sin_cos_keys = set(state_config.sin_cos_embedding_keys)
for joint_group in state_config.modality_keys:
if joint_group not in state:
raise KeyError(
f"Joint group '{joint_group}' not found in state dict for embodiment '{embodiment_tag}'"
)
params = self.norm_params[embodiment_tag]["state"][joint_group]
if sin_cos_keys and joint_group in sin_cos_keys:
normalized = apply_sin_cos_encoding(state[joint_group], params)
elif (
state_config.mean_std_embedding_keys is not None
and joint_group in state_config.mean_std_embedding_keys
):
normalized = normalize_values_meanstd(state[joint_group], params)
else:
normalized = normalize_values_minmax(state[joint_group], params)
normalized_values[joint_group] = normalized
N1.7不是对所有状态都简单套一个标准化公式,而是支持按group配置不同处理方式。连续低维状态可以使用min/max,某些embedding类状态可以使用mean/std,关节角这类周期变量还可以使用sin/cos编码。
动作处理函数是apply_action(),顺序是先做相对动作转换,再做归一化,
def apply_action(
self,
action: dict[str, np.ndarray],
embodiment_tag: str,
state: dict[str, np.ndarray] | None = None,
) -> dict[str, np.ndarray]:
"""
Apply action processing (absolute->relative conversion, normalization).
Processing order:
1. Convert absolute actions to relative (if configured)
2. Normalize actions
"""
源码会根据action_configs逐组判断哪些动作需要转换成相对动作,
modality_keys = self.modality_configs[embodiment_tag]["action"].modality_keys
action_configs = self.modality_configs[embodiment_tag]["action"].action_configs
if action_configs is not None:
for key, action_config in zip(modality_keys, action_configs):
if action_config.rep == ActionRepresentation.RELATIVE and self.use_relative_action:
if state is None:
raise ValueError(
f"State dict required for relative action processing of key '{key}' "
f"in embodiment '{embodiment_tag}'"
)
state_key = action_config.state_key if action_config.state_key else key
if state_key not in state:
raise KeyError(
f"Reference state key '{state_key}' not found in state dict "
f"for embodiment '{embodiment_tag}'"
)
reference_state = state[state_key][-1]
action[key] = self._convert_to_relative_action(
action=action[key],
reference_state=reference_state,
action_type=action_config.type,
action_format=action_config.format,
)
使用statestate_key-1作为参考状态,也就是状态历史中的最后一帧。训练样本中的动作块通常是从当前时刻开始的一段未来动作,所以相对动作自然应该相对于当前状态来计算。
底层相对动作转换没有简单写成action - reference_state,而是区分了末端动作和非末端动作,
if action_type == ActionType.EEF:
action_chunking = EndEffectorActionChunk.from_array(action, action_format)
reference_frame = EndEffectorPose.from_action_format(reference_state, action_format)
elif action_type == ActionType.NON_EEF:
action_chunking = JointActionChunk([JointPose(m) for m in action])
reference_frame = JointPose(reference_state)
关节动作可以比较直接地看成向量差分,但末端位姿包含旋转,尤其是XYZ_ROT6D这类格式,不能对每一维随便做减法。因此源码使用EndEffectorActionChunk和EndEffectorPose处理EEF动作,用JointActionChunk和JointPose处理非EEF动作。
动作转换完成后,再进行归一化和clip,
for joint_group in modality_keys:
params = self.norm_params[embodiment_tag]["action"][joint_group]
if (
self.modality_configs[embodiment_tag]["action"].mean_std_embedding_keys is not None
and joint_group
in self.modality_configs[embodiment_tag]["action"].mean_std_embedding_keys
):
normalized = normalize_values_meanstd(action[joint_group], params)
else:
normalized = normalize_values_minmax(action[joint_group], params)
if self.clip_outliers:
normalized = np.clip(normalized, -1.0, 1.0)
normalized_values[joint_group] = normalized
StateActionProcessor返回的动作仍然是按group组织的字典。Gr00tN1d7Processor会按modality_keys顺序把它们拼接成一个动作矩阵,
action_keys = self.modality_configs[embodiment_tag.value]["action"].modality_keys
normalized_actions = torch.cat(
[torch.from_numpy(normalized_actions[key]) for key in action_keys],
dim=-1,
) # (t, d)
action_dim = normalized_actions.shape[1]
不同机器人动作维度和动作长度可能不同,因此后面会统一补齐到max_action_dim和max_action_horizon,
normalized_actions = torch.cat(
[
normalized_actions,
torch.zeros(
normalized_actions.shape[0],
self.max_action_dim - normalized_actions.shape[1],
),
],
dim=-1,
) # (t, max_action_dim)
然后补齐动作长度,
normalized_actions = torch.cat(
[
normalized_actions,
torch.zeros(
self.max_action_horizon - normalized_actions.shape[0],
self.max_action_dim,
),
],
dim=0,
) # (max_action_horizon, max_action_dim)
最后生成action_mask,
action_mask = torch.ones_like(normalized_actions)
action_mask[action_horizon:] = 0
action_mask[:, action_dim:] = 0
action_maskaction_horizon: = 0表示超过真实动作块长度的时间位置不参与loss;action_mask:, action_dim: = 0表示超过真实动作维度的padding维度不参与loss。这样不同机器人可以混合进同一个Batch,模型动作头也不需要为每种机器人动态改变输出维度。
5、推理侧复用训练时的Processor配置完成输入输出转换
训练阶段输入来自Dataset,推理阶段输入来自环境Observation。N1.7没有给推理单独写一套完全不同的预处理,而是在Gr00tN1d7Processor中提供了process_observation()。
函数开头会根据机器人类型找到对应模态配置,
def process_observation(self, observation: dict[str, Any], embodiment_tag: EmbodimentTag):
"""Process batched observation tensors for inference."""
modality_config = self.modality_configs[embodiment_tag.value]
transformed_observation = {}
状态处理和训练阶段类似,只是输入字段带有state.前缀,
state_keys = modality_config["state"].modality_keys
state_data = {key: observation[f"state.{key}"] for key in state_keys}
norm_state_dict = self.state_action_processor.apply_state(
state=state_data, embodiment_tag=embodiment_tag.value
)
normalized_states = torch.cat(
[torch.from_numpy(norm_state_dict[key]) for key in state_keys], dim=-1
)
然后补齐到max_state_dim,
padding_shape = (
*normalized_states.shape[:-1],
self.max_state_dim - normalized_states.shape[-1],
)
normalized_states = torch.cat([normalized_states, torch.zeros(padding_shape)], dim=-1)
transformed_observation["state"] = normalized_states
图像输入也会按照配置中的view顺序堆叠,并调用同一个Qwen3-VL Processor。最后生成embodiment_id,
embodiment_id = (
torch.ones(B, dtype=torch.int32) * self.embodiment_id_mapping[embodiment_tag.value]
)
transformed_observation["embodiment_id"] = embodiment_id
模型输出的是补齐后的归一化动作,不能直接发送给机器人。decode_action()负责把动作切回各个group,并调用StateActionProcessor.unapply_action()做反归一化和相对动作还原,
def decode_action(
self,
action: np.ndarray,
embodiment_tag: EmbodimentTag,
state: dict[str, np.ndarray] | None = None,
):
"""Undo action normalization and convert relative actions to absolute."""
out_dict = {}
start_idx = 0
joint_groups = self.modality_configs[embodiment_tag.value]["action"].modality_keys
action_horizon = len(self.modality_configs[embodiment_tag.value]["action"].delta_indices)
for key in joint_groups:
joint_dim = self.state_action_processor.norm_params[embodiment_tag.value]["action"][
key
]["dim"].item()
out_dict[key] = action[..., :action_horizon, start_idx : start_idx + joint_dim]
start_idx += joint_dim
return self.state_action_processor.unapply_action(
out_dict, embodiment_tag.value, state=state
)
unapply_action()的顺序和训练时相反:先反归一化,再把相对动作转回绝对动作。
def unapply_action(
self,
action: dict[str, np.ndarray],
embodiment_tag: str,
state: dict[str, np.ndarray] | None = None,
) -> dict[str, np.ndarray]:
"""
Reverse action processing (denormalization, relative->absolute conversion).
Processing order:
1. Denormalize actions
2. Convert relative actions to absolute (if configured)
"""
这也是推理阶段必须保留当前原始state的原因。如果动作组是相对表示,模型输出只表示相对变化,没有当前状态就无法恢复成机器人控制器需要的绝对动作。
6、自定义机器人接入主要检查数据层配置
看完这部分源码以后,自定义机器人接入GR00T N1.7时,主要改动集中在数据层,而不是动作头内部。
第一步是准备LeRobot格式数据,至少包含低维状态、动作、视频、任务语言以及meta/目录下的标准文件。stats.json必须存在,否则LeRobotEpisodeLoader初始化时会直接报错。如果训练使用相对动作,还要确认relative_stats.json和动作表示一致。
第二步是写modality.json,把原始状态和动作向量切成有物理含义的group。比如一个7轴机械臂加夹爪,可以拆成,
{
"state": {
"joint_position": {
"start": 0,
"end": 7
},
"gripper": {
"start": 7,
"end": 8
}
},
"action": {
"joint_position": {
"start": 0,
"end": 7
},
"gripper": {
"start": 7,
"end": 8
}
},
"video": {
"front": {
"original_key": "observation.images.front"
},
"wrist": {
"original_key": "observation.images.wrist"
}
},
"annotation": {
"human.task_description": {
"original_key": "task_index"
}
}
}
第三步是确认ModalityConfig.delta_indices。如果希望模型一次预测16步动作,动作配置可以类似,
"action": ModalityConfig(
delta_indices=list(range(16)),
modality_keys=["joint_position", "gripper"],
)
如果要使用过去图像,可以把video配置成,
"video": ModalityConfig(
delta_indices=[-15, 0],
modality_keys=["front", "wrist"],
)
第四步是配置动作表示。关节位置可以使用相对动作,夹爪通常更适合保持绝对动作,
action_configs=[
ActionConfig(
rep=ActionRepresentation.RELATIVE,
type=ActionType.NON_EEF,
format=ActionFormat.DEFAULT,
state_key="joint_position",
),
ActionConfig(
rep=ActionRepresentation.ABSOLUTE,
type=ActionType.NON_EEF,
format=ActionFormat.DEFAULT,
state_key="gripper",
),
]
第五步是确认推理Observation能够提供原始state。因为相对动作反解码需要当前状态,如果环境Observation里没有对应的state.xxx字段,decode_action()最终会在unapply_action()里报错。