GR00T N1.7源码学习（二）：训练数据、Processor与多机器人动作空间解析

GR00T N1.7源码学习（一）：工程入口、模型结构与动作生成流程解析-CSDN博客

上一篇博客主要把GR00T N1.7的主模型、动作头、Flow Matching训练目标和推理阶段的动作生成过程过了一遍。本文关注的是一条原始机器人轨迹是怎样从LeRobot数据集目录中被读出来，然后一步步变成Gr00tN1d7.forward()可以直接使用的输入字典。源码主要涉及下面几个文件，

复制代码

gr00t/data/dataset/lerobot_episode_loader.py
gr00t/data/dataset/sharded_single_step_dataset.py
gr00t/data/types.py
gr00t/data/state_action/state_action_processor.py
gr00t/model/gr00t_n1d7/processing_gr00t_n1d7.py
gr00t/configs/data/embodiment_configs.py
examples/SO100/modality.json
examples/LIBERO/modality.json

1、LeRobot目录、modality.json和ModalityConfig共同定义数据语义

GR00T N1.7没有重新定义一套完全独立的数据格式，而是复用了LeRobot风格的数据目录。一个典型数据集目录大致如下，

复制代码

dataset_root/
├── data/
├── videos/
└── meta/
    ├── info.json
    ├── episodes.jsonl
    ├── tasks.jsonl
    ├── modality.json
    ├── stats.json
    └── relative_stats.json

data下面通常是parquet数据，保存低维状态、动作、时间戳、任务索引等信息；videos下面保存相机视频；meta下面则是整个数据集的描述文件。GR00T自己的加载逻辑集中在LeRobotEpisodeLoader中，源码开头把几个标准文件名写成了常量，

复制代码

LEROBOT_META_DIR_NAME = "meta"
LEROBOT_INFO_FILENAME = "info.json"
LEROBOT_EPISODES_FILENAME = "episodes.jsonl"
LEROBOT_TASKS_FILENAME = "tasks.jsonl"
LEROBOT_MODALITY_FILENAME = "modality.json"
LEROBOT_STATS_FILE_NAME = "stats.json"
LEROBOT_RELATIVE_STATS_FILE_NAME = "relative_stats.json"

ALLOWED_MODALITIES = ["video", "state", "action", "language", "mask"]
DEFAULT_COLUMN_NAMES = {
    "state": "observation.state",
    "action": "action",
}

LANG_KEYS = ["task", "sub_task"]

其中stats.json是强依赖文件，不存在会直接报错，

复制代码

stats_path = meta_dir / LEROBOT_STATS_FILE_NAME
assert stats_path.exists(), (
    f"{stats_path} does not exist for {self.dataset_path}, please use gr00t/data/stats.py to generate it"
)
with open(stats_path, "r") as f:
    self.stats = json.load(f)

说明GR00T训练前必须准备好状态和动作统计量。状态、动作进入模型前都要归一化，模型输出动作后还要反归一化，如果统计量缺失，后面的Processor就没有办法正确工作。

如果数据集中存在relative_stats.json，源码会继续读取，并把它放到self.stats $"relative_action"$ 下面，

复制代码

relative_stats_path = meta_dir / LEROBOT_RELATIVE_STATS_FILE_NAME
if relative_stats_path.exists():
    with open(relative_stats_path, "r") as f:
        relative_stats = json.load(f)
    relative_stats.pop("__fingerprints__", None)
    self.stats["relative_action"] = relative_stats

modality.json负责把原始数据字段映射成模型使用的模态。以examples/SO100/modality.json为例，

复制代码

{
    "state": {
        "single_arm": {
            "start": 0,
            "end": 5
        },
        "gripper": {
            "start": 5,
            "end": 6
        }
    },
    "action": {
        "single_arm": {
            "start": 0,
            "end": 5
        },
        "gripper": {
            "start": 5,
            "end": 6
        }
    },
    "video": {
        "front": {
            "original_key": "observation.images.front"
        },
        "wrist": {
            "original_key": "observation.images.wrist"
        }
    },
    "annotation": {
        "human.task_description": {
            "original_key": "task_index"
        }
    }
}

把原始状态向量切成single_arm和gripper两组，动作也按同样方式切分；图像从observation.images.front和observation.images.wrist读取，并重命名为front和wrist。后续代码里看到的state.single_arm、state.gripper、video.front、video.wrist，本质上都是从这个配置来的。

modality.json偏数据集侧，而训练时真正使用的是ModalityConfig。它定义在gr00t/data/types.py中，

复制代码

@dataclass
class ModalityConfig:
    """Configuration for a modality defining how data should be sampled and loaded."""

    delta_indices: list[int]
    """Delta indices to sample relative to the current index."""

    modality_keys: list[str]
    """The keys to load for the modality in the dataset."""

    sin_cos_embedding_keys: list[str] | None = None
    mean_std_embedding_keys: list[str] | None = None
    action_configs: list[ActionConfig] | None = None

delta_indices描述相对于当前step要取哪些时间位置。例如视频配置为 $-15, 0$ ，表示取过去一帧和当前帧；状态配置为 $0$ ，表示只取当前状态；动作配置为list(range(16))，表示从当前step开始取未来16步动作，构成Action Chunk。

内置机器人配置保存在gr00t/configs/data/embodiment_configs.py。以oxe_droid_relative_eef_relative_joint为例，

复制代码

"oxe_droid_relative_eef_relative_joint": {
    "video": ModalityConfig(
        delta_indices=[-15, 0],
        modality_keys=["exterior_image_1_left", "wrist_image_left"],
    ),
    "state": ModalityConfig(
        delta_indices=[0],
        modality_keys=["eef_9d", "gripper_position", "joint_position"],
    ),
    "action": ModalityConfig(
        delta_indices=list(range(40)),
        modality_keys=["eef_9d", "gripper_position", "joint_position"],
        action_configs=[
            ActionConfig(
                rep=ActionRepresentation.RELATIVE,
                type=ActionType.EEF,
                format=ActionFormat.XYZ_ROT6D,
                state_key="eef_9d",
            ),
            ActionConfig(
                rep=ActionRepresentation.ABSOLUTE,
                type=ActionType.NON_EEF,
                format=ActionFormat.DEFAULT,
                state_key="gripper_position",
            ),
            ActionConfig(
                rep=ActionRepresentation.RELATIVE,
                type=ActionType.NON_EEF,
                format=ActionFormat.DEFAULT,
                state_key="joint_position",
            ),
        ],
    ),
    "language": ModalityConfig(
        delta_indices=[0],
        modality_keys=["annotation.language.language_instruction"],
    ),
}

use_relative_action=True并不等于所有动作都做相对表示。是否使用相对动作，还要看每个动作组自己的ActionConfig。例如eef_9d是相对末端动作，joint_position是相对关节动作，而gripper_position保持绝对动作。

2、Dataset把完整episode拆成模型训练样本

LeRobotEpisodeLoader负责按episode读取完整轨迹。它的__getitem__返回的不是单个训练样本，而是一个完整episode对应的DataFrame，

复制代码

def __getitem__(self, idx: int) -> pd.DataFrame:
    if idx < 0 or idx >= len(self):
        raise IndexError(f"Episode index {idx} out of bounds")

    episode_meta = self.episodes_metadata[idx]
    episode_id = episode_meta["episode_index"]
    nominal_length = episode_meta["length"]

    # Load and parse the parquet data
    df = self._load_parquet_data(episode_id)

    if "language" in self.modality_configs:
        lang_key = self.modality_configs["language"].modality_keys[0]
        if lang_key in LANG_KEYS:
            new_languages = self.create_language_from_meta(episode_meta, len(df), lang_key)
            df["language." + lang_key] = new_languages

    actual_length = min(len(df), nominal_length)
    df = df.iloc[:actual_length]

    # Load synchronized video data
    video_data = self._load_video_data(episode_id, np.arange(actual_length))

    for key in video_data.keys():
        assert len(video_data[key]) == len(df), (
            f"Video data for {key} has length {len(video_data[key])} but dataframe has length {len(df)}"
        )
        df[f"video.{key}"] = [frame for frame in video_data[key]]

    return df

parquet里通常已经包含状态和动作，图像则通过_load_video_data从视频文件中取帧，并写入DataFrame。语言字段也会在这里做统一处理，如果配置的是task或sub_task，代码会从metadata中生成每一帧对应的语言文本。最终DataFrame列名大致会变成，

复制代码

state.xxx
action.xxx
video.front
video.wrist
language.task
mask.xxx

完整episode还不能直接喂给模型，后面要由ShardedSingleStepDataset拆成以某个step为中心的训练样本。该类定义在，

复制代码

gr00t/data/dataset/sharded_single_step_dataset.py

初始化时会根据动作配置计算action horizon，

复制代码

action_delta_indices = modality_configs["action"].delta_indices
self.action_horizon = max(action_delta_indices) - min(action_delta_indices) + 1

有效episode长度会扣掉尾部凑不齐未来动作块的step，

复制代码

def get_effective_episode_length(self, episode_index: int) -> int:
    original_length = self.episode_loader.get_episode_length(episode_index)
    return max(0, original_length - self.action_horizon + 1)

假设一条episode长度是100，动作块长度是16，那么可作为训练起点的step最多只有85个，因为从后面的step开始已经凑不齐未来16步动作。取单个训练样本时，会调用extract_step_data，

复制代码

def get_datapoint(self, episode_data: pd.DataFrame, step_index: int) -> dict:
    assert self.processor is not None, "Processor must be set before getting datapoints"
    vla_step_data = extract_step_data(
        episode_data,
        step_index,
        self.modality_configs,
        self.embodiment_tag,
        self.allow_padding,
    )
    messages = [{"type": MessageType.EPISODE_STEP.value, "content": vla_step_data}]
    return self.processor(messages)

extract_step_data按照delta_indices抽取每种模态的数据，

复制代码

def extract_step_data(
    episode_data: pd.DataFrame,
    step_index: int,
    modality_configs: dict[str, ModalityConfig],
    embodiment_tag: EmbodimentTag,
    allow_padding: bool = False,
) -> VLAStepData:
    step_data = {}

    for modality, config in modality_configs.items():
        step_data[modality] = {}
        indices_to_load = [step_index + delta_index for delta_index in config.delta_indices]
        if allow_padding:
            indices_to_load = [max(0, min(idx, len(episode_data) - 1)) for idx in indices_to_load]

        for key in config.modality_keys:
            if f"{modality}.{key}" in episode_data.columns:
                modality_data = episode_data[f"{modality}.{key}"].iloc[indices_to_load]
            else:
                raise KeyError(
                    f"{modality}.{key} not found in episode data, available keys: {episode_data.columns}"
                )

            if modality in ["state", "action"]:
                step_data[modality][key] = np.vstack(
                    [
                        np.array(modality_data.iloc[i]).astype(np.float32)
                        for i in range(len(modality_data))
                    ]
                )
            else:
                step_data[modality][key] = modality_data.tolist()

例如当前step为100，动作delta_indices=list(range(16))，就会读取100到115这16个动作；视频delta_indices= $-15, 0$ ，就会读取第85帧和第100帧；状态delta_indices= $0$ ，就只取当前状态。

最后这些数据会被封装成VLAStepData，

复制代码

vla_step_data = VLAStepData(
    images=video_data,
    masks=mask_data if mask_data else None,
    states=state_data,
    actions=action_data,
    text=text,
    embodiment=embodiment_tag,
)
return vla_step_data

VLAStepData是Dataset和Processor之间的中间结构，

复制代码

@dataclass
class VLAStepData:
    """
    Represents a single step of VLA (Vision-Language-Action) data.
    """

    images: dict[str, list[np.ndarray]]
    states: dict[str, np.ndarray]
    actions: dict[str, np.ndarray]
    masks: dict[str, list[np.ndarray]] | None = None
    text: str | None = None
    embodiment: EmbodimentTag = EmbodimentTag.NEW_EMBODIMENT
    is_demonstration: bool = False
    metadata: dict[str, Any] = field(default_factory=dict)

3、Gr00tN1d7Processor统一状态动作、图像语言和embodiment输入

Gr00tN1d7Processor定义在，

复制代码

gr00t/model/gr00t_n1d7/processing_gr00t_n1d7.py

它负责把VLAStepData整理成模型输入。初始化时会保存模态配置、统计量配置、图像处理配置、VLM Processor配置和embodiment_id映射，

复制代码

class Gr00tN1d7Processor(BaseProcessor):
    data_collator_class = Gr00tN1d7DataCollator

    def __init__(
        self,
        modality_configs: dict[str, dict[str, ModalityConfig]],
        statistics: dict[str, dict[str, dict[str, dict[str, list[float]]]]] | None = None,
        use_percentiles: bool = False,
        clip_outliers: bool = True,
        model_name: str = "nvidia/Cosmos-Reason2-2B",
        model_type: str = "qwen",
        max_state_dim: int = 29,
        max_action_dim: int = 29,
        max_action_horizon: int = 50,
        use_relative_action: bool = False,
        embodiment_id_mapping: dict[str, int] | None = None,
        exclude_state: bool = False,
        state_dropout_prob: float = 0.0,
        use_mean_std: bool = False,
        ...
    ):
        self.modality_configs = parse_modality_configs(modality_configs)

状态和动作的处理没有直接写在Processor里，而是交给StateActionProcessor，

复制代码

self.state_action_processor = StateActionProcessor(
    modality_configs=modality_configs,
    statistics=statistics,
    use_percentiles=use_percentiles,
    clip_outliers=clip_outliers,
    apply_sincos_state_encoding=apply_sincos_state_encoding,
    use_relative_action=use_relative_action,
)

Gr00tN1d7Processor负责完整多模态样本，包括图像、语言、状态、动作、Mask和VLM输入；StateActionProcessor专门负责低维状态动作，包括归一化、相对动作转换、sin/cos编码和反归一化。

Processor内部还会创建Qwen3-VL Processor，

复制代码

self.processor = build_processor(model_name, transformers_loading_kwargs)
self.processor.tokenizer.padding_side = "left"

训练入口__call__()先取出VLAStepData内容，然后把状态和动作交给StateActionProcessor，

复制代码

def __call__(
    self,
    messages: list[dict[str, Any]],
):
    assert len(messages) == 1
    content = messages[0]["content"]
    embodiment_tag = content.embodiment
    action_data = content.actions
    state_data = content.states

    norm_state_dict, normalized_actions = self.state_action_processor.apply(
        state=state_data,
        action=action_data,
        embodiment_tag=embodiment_tag.value,
    )

StateActionProcessor.apply()内部先处理状态，再处理动作，

复制代码

def apply(
    self,
    state: dict[str, np.ndarray],
    action: dict[str, np.ndarray],
    embodiment_tag: str,
) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]:
    """
    Apply both state and action processing together.
    """
    processed_state = self.apply_state(state, embodiment_tag)
    if action:
        processed_action = self.apply_action(action, embodiment_tag, state=state)
    else:
        assert not self.training, "Action is required in training mode"
        processed_action = {}
    return processed_state, processed_action

相对动作转换需要原始state作为参考，如果先把state处理成归一化后的值，再拿它去计算相对动作，就会改变物理含义。因此动作处理函数拿到的是原始state，而不是归一化后的state。

图像和语言在Processor中也会被整理成VLM输入。训练时根据self.training选择图像增强，

复制代码

if self.training:
    image_transform = self.train_image_transform
else:
    image_transform = self.eval_image_transform

image_keys = self.modality_configs[embodiment_tag.value]["video"].modality_keys

语言默认会做小写和符号清理，

复制代码

if self.formalize_language:
    language = content.text.lower()
    language = re.sub(r"[^\w\s]", "", language)
else:
    language = content.text

图像会先按view和时间堆叠，

复制代码

for view in image_keys:
    assert view in images, f"{view} not in {images}"
    temporal_stacked_images[view] = torch.stack(
        [image_transform(img) for img in images[view]]
    )  # (T, C, H, W)

然后把时间和相机两个维度展平，交给VLM Processor，

复制代码

stacked_images = (
    torch.stack([temporal_stacked_images[view] for view in image_keys], dim=1)
    .flatten(0, 1)
    .numpy()
)  # (T*V, C, H, W)

如果一个样本有2个时间点、2个相机，那么最终传给VLM Processor的是4张图。这里没有单独为每张图增加显式结构字段，时间和相机信息主要通过图像排列顺序以及prompt组织方式体现。

Batch级别真正调用Qwen3-VL Processor的是Gr00tN1d7DataCollator。单样本Processor返回的VLM部分是vlm_content，Collator会统一抽取文本和图像，

复制代码

if key == "vlm_content":
    text_list = []
    image_inputs = []
    for v in values:
        curr_text_list = [v["text"]]
        text_list += curr_text_list
        curr_image_inputs = v["images"]
        image_inputs += curr_image_inputs

    vlm_inputs = self.processor(
        text=text_list,
        images=image_inputs,
        return_tensors="pt",
        padding=True,
    )
    for k, v in vlm_inputs.items():
        batch[k] = v

低维状态、动作和Mask已经在单样本阶段补齐成固定形状，因此可以直接stack，

复制代码

else:
    batch[key] = torch.from_numpy(np.stack(values))
return BatchFeature(data={"inputs": batch})

最终模型拿到的输入大致是，

复制代码

{
    "inputs": {
        "input_ids": ...,
        "attention_mask": ...,
        "pixel_values": ...,
        "image_grid_thw": ...,
        "state": ...,
        "action": ...,
        "action_mask": ...,
        "embodiment_id": ...,
    }
}

4、StateActionProcessor处理归一化、相对动作和Mask

StateActionProcessor是关键的低维数据处理模块。它的职责在类注释中已经写得比较清楚，

复制代码

class StateActionProcessor:
    """
    Unified processor for robot state and action data.

    Handles:
    - State normalization (min/max, mean/std, sin/cos encoding)
    - Action normalization
    - Absolute <-> Relative action representation conversion
    - Action processing with state dependency
    """

状态处理函数是apply_state()。它会遍历当前机器人配置中的每个state group，然后根据配置选择归一化方式，

复制代码

def apply_state(
    self,
    state: dict[str, np.ndarray],
    embodiment_tag: str,
) -> dict[str, np.ndarray]:
    normalized_values = {}
    state_config = self.modality_configs[embodiment_tag]["state"]

    sin_cos_keys = set()
    if self.apply_sincos_state_encoding and hasattr(state_config, "sin_cos_embedding_keys"):
        sin_cos_keys = set(state_config.sin_cos_embedding_keys)

    for joint_group in state_config.modality_keys:
        if joint_group not in state:
            raise KeyError(
                f"Joint group '{joint_group}' not found in state dict for embodiment '{embodiment_tag}'"
            )

        params = self.norm_params[embodiment_tag]["state"][joint_group]

        if sin_cos_keys and joint_group in sin_cos_keys:
            normalized = apply_sin_cos_encoding(state[joint_group], params)
        elif (
            state_config.mean_std_embedding_keys is not None
            and joint_group in state_config.mean_std_embedding_keys
        ):
            normalized = normalize_values_meanstd(state[joint_group], params)
        else:
            normalized = normalize_values_minmax(state[joint_group], params)

        normalized_values[joint_group] = normalized

N1.7不是对所有状态都简单套一个标准化公式，而是支持按group配置不同处理方式。连续低维状态可以使用min/max，某些embedding类状态可以使用mean/std，关节角这类周期变量还可以使用sin/cos编码。

动作处理函数是apply_action()，顺序是先做相对动作转换，再做归一化，

复制代码

def apply_action(
    self,
    action: dict[str, np.ndarray],
    embodiment_tag: str,
    state: dict[str, np.ndarray] | None = None,
) -> dict[str, np.ndarray]:
    """
    Apply action processing (absolute->relative conversion, normalization).

    Processing order:
    1. Convert absolute actions to relative (if configured)
    2. Normalize actions
    """

源码会根据action_configs逐组判断哪些动作需要转换成相对动作，

复制代码

modality_keys = self.modality_configs[embodiment_tag]["action"].modality_keys
action_configs = self.modality_configs[embodiment_tag]["action"].action_configs

if action_configs is not None:
    for key, action_config in zip(modality_keys, action_configs):
        if action_config.rep == ActionRepresentation.RELATIVE and self.use_relative_action:
            if state is None:
                raise ValueError(
                    f"State dict required for relative action processing of key '{key}' "
                    f"in embodiment '{embodiment_tag}'"
                )

            state_key = action_config.state_key if action_config.state_key else key

            if state_key not in state:
                raise KeyError(
                    f"Reference state key '{state_key}' not found in state dict "
                    f"for embodiment '{embodiment_tag}'"
                )

            reference_state = state[state_key][-1]

            action[key] = self._convert_to_relative_action(
                action=action[key],
                reference_state=reference_state,
                action_type=action_config.type,
                action_format=action_config.format,
            )

使用state $state_key$ $-1$ 作为参考状态，也就是状态历史中的最后一帧。训练样本中的动作块通常是从当前时刻开始的一段未来动作，所以相对动作自然应该相对于当前状态来计算。

底层相对动作转换没有简单写成action - reference_state，而是区分了末端动作和非末端动作，

复制代码

if action_type == ActionType.EEF:
    action_chunking = EndEffectorActionChunk.from_array(action, action_format)
    reference_frame = EndEffectorPose.from_action_format(reference_state, action_format)

elif action_type == ActionType.NON_EEF:
    action_chunking = JointActionChunk([JointPose(m) for m in action])
    reference_frame = JointPose(reference_state)

关节动作可以比较直接地看成向量差分，但末端位姿包含旋转，尤其是XYZ_ROT6D这类格式，不能对每一维随便做减法。因此源码使用EndEffectorActionChunk和EndEffectorPose处理EEF动作，用JointActionChunk和JointPose处理非EEF动作。

动作转换完成后，再进行归一化和clip，

复制代码

for joint_group in modality_keys:
    params = self.norm_params[embodiment_tag]["action"][joint_group]
    if (
        self.modality_configs[embodiment_tag]["action"].mean_std_embedding_keys is not None
        and joint_group
        in self.modality_configs[embodiment_tag]["action"].mean_std_embedding_keys
    ):
        normalized = normalize_values_meanstd(action[joint_group], params)
    else:
        normalized = normalize_values_minmax(action[joint_group], params)

    if self.clip_outliers:
        normalized = np.clip(normalized, -1.0, 1.0)

    normalized_values[joint_group] = normalized

StateActionProcessor返回的动作仍然是按group组织的字典。Gr00tN1d7Processor会按modality_keys顺序把它们拼接成一个动作矩阵，

复制代码

action_keys = self.modality_configs[embodiment_tag.value]["action"].modality_keys
normalized_actions = torch.cat(
    [torch.from_numpy(normalized_actions[key]) for key in action_keys],
    dim=-1,
)  # (t, d)
action_dim = normalized_actions.shape[1]

不同机器人动作维度和动作长度可能不同，因此后面会统一补齐到max_action_dim和max_action_horizon，

复制代码

normalized_actions = torch.cat(
    [
        normalized_actions,
        torch.zeros(
            normalized_actions.shape[0],
            self.max_action_dim - normalized_actions.shape[1],
        ),
    ],
    dim=-1,
)  # (t, max_action_dim)

然后补齐动作长度，

复制代码

normalized_actions = torch.cat(
    [
        normalized_actions,
        torch.zeros(
            self.max_action_horizon - normalized_actions.shape[0],
            self.max_action_dim,
        ),
    ],
    dim=0,
)  # (max_action_horizon, max_action_dim)

最后生成action_mask，

复制代码

action_mask = torch.ones_like(normalized_actions)
action_mask[action_horizon:] = 0
action_mask[:, action_dim:] = 0

action_mask $action_horizon:$ = 0表示超过真实动作块长度的时间位置不参与loss；action_mask $:, action_dim:$ = 0表示超过真实动作维度的padding维度不参与loss。这样不同机器人可以混合进同一个Batch，模型动作头也不需要为每种机器人动态改变输出维度。

5、推理侧复用训练时的Processor配置完成输入输出转换

训练阶段输入来自Dataset，推理阶段输入来自环境Observation。N1.7没有给推理单独写一套完全不同的预处理，而是在Gr00tN1d7Processor中提供了process_observation()。

函数开头会根据机器人类型找到对应模态配置，

复制代码

def process_observation(self, observation: dict[str, Any], embodiment_tag: EmbodimentTag):
    """Process batched observation tensors for inference."""
    modality_config = self.modality_configs[embodiment_tag.value]
    transformed_observation = {}

状态处理和训练阶段类似，只是输入字段带有state.前缀，

复制代码

state_keys = modality_config["state"].modality_keys
state_data = {key: observation[f"state.{key}"] for key in state_keys}

norm_state_dict = self.state_action_processor.apply_state(
    state=state_data, embodiment_tag=embodiment_tag.value
)
normalized_states = torch.cat(
    [torch.from_numpy(norm_state_dict[key]) for key in state_keys], dim=-1
)

然后补齐到max_state_dim，

复制代码

padding_shape = (
    *normalized_states.shape[:-1],
    self.max_state_dim - normalized_states.shape[-1],
)
normalized_states = torch.cat([normalized_states, torch.zeros(padding_shape)], dim=-1)
transformed_observation["state"] = normalized_states

图像输入也会按照配置中的view顺序堆叠，并调用同一个Qwen3-VL Processor。最后生成embodiment_id，

复制代码

embodiment_id = (
    torch.ones(B, dtype=torch.int32) * self.embodiment_id_mapping[embodiment_tag.value]
)
transformed_observation["embodiment_id"] = embodiment_id

模型输出的是补齐后的归一化动作，不能直接发送给机器人。decode_action()负责把动作切回各个group，并调用StateActionProcessor.unapply_action()做反归一化和相对动作还原，

复制代码

def decode_action(
    self,
    action: np.ndarray,
    embodiment_tag: EmbodimentTag,
    state: dict[str, np.ndarray] | None = None,
):
    """Undo action normalization and convert relative actions to absolute."""
    out_dict = {}
    start_idx = 0
    joint_groups = self.modality_configs[embodiment_tag.value]["action"].modality_keys
    action_horizon = len(self.modality_configs[embodiment_tag.value]["action"].delta_indices)

    for key in joint_groups:
        joint_dim = self.state_action_processor.norm_params[embodiment_tag.value]["action"][
            key
        ]["dim"].item()
        out_dict[key] = action[..., :action_horizon, start_idx : start_idx + joint_dim]
        start_idx += joint_dim

    return self.state_action_processor.unapply_action(
        out_dict, embodiment_tag.value, state=state
    )

unapply_action()的顺序和训练时相反：先反归一化，再把相对动作转回绝对动作。

复制代码

def unapply_action(
    self,
    action: dict[str, np.ndarray],
    embodiment_tag: str,
    state: dict[str, np.ndarray] | None = None,
) -> dict[str, np.ndarray]:
    """
    Reverse action processing (denormalization, relative->absolute conversion).

    Processing order:
    1. Denormalize actions
    2. Convert relative actions to absolute (if configured)
    """

这也是推理阶段必须保留当前原始state的原因。如果动作组是相对表示，模型输出只表示相对变化，没有当前状态就无法恢复成机器人控制器需要的绝对动作。

6、自定义机器人接入主要检查数据层配置

看完这部分源码以后，自定义机器人接入GR00T N1.7时，主要改动集中在数据层，而不是动作头内部。

第一步是准备LeRobot格式数据，至少包含低维状态、动作、视频、任务语言以及meta/目录下的标准文件。stats.json必须存在，否则LeRobotEpisodeLoader初始化时会直接报错。如果训练使用相对动作，还要确认relative_stats.json和动作表示一致。

第二步是写modality.json，把原始状态和动作向量切成有物理含义的group。比如一个7轴机械臂加夹爪，可以拆成，

复制代码

{
    "state": {
        "joint_position": {
            "start": 0,
            "end": 7
        },
        "gripper": {
            "start": 7,
            "end": 8
        }
    },
    "action": {
        "joint_position": {
            "start": 0,
            "end": 7
        },
        "gripper": {
            "start": 7,
            "end": 8
        }
    },
    "video": {
        "front": {
            "original_key": "observation.images.front"
        },
        "wrist": {
            "original_key": "observation.images.wrist"
        }
    },
    "annotation": {
        "human.task_description": {
            "original_key": "task_index"
        }
    }
}

第三步是确认ModalityConfig.delta_indices。如果希望模型一次预测16步动作，动作配置可以类似，

复制代码

"action": ModalityConfig(
    delta_indices=list(range(16)),
    modality_keys=["joint_position", "gripper"],
)

如果要使用过去图像，可以把video配置成，

复制代码

"video": ModalityConfig(
    delta_indices=[-15, 0],
    modality_keys=["front", "wrist"],
)

第四步是配置动作表示。关节位置可以使用相对动作，夹爪通常更适合保持绝对动作，

复制代码

action_configs=[
    ActionConfig(
        rep=ActionRepresentation.RELATIVE,
        type=ActionType.NON_EEF,
        format=ActionFormat.DEFAULT,
        state_key="joint_position",
    ),
    ActionConfig(
        rep=ActionRepresentation.ABSOLUTE,
        type=ActionType.NON_EEF,
        format=ActionFormat.DEFAULT,
        state_key="gripper",
    ),
]

第五步是确认推理Observation能够提供原始state。因为相对动作反解码需要当前状态，如果环境Observation里没有对应的state.xxx字段，decode_action()最终会在unapply_action()里报错。