强化学习环境 - robogym - 学习 - 2

文章目录

[强化学习环境 - robogym - 学习 - 2](#强化学习环境 - robogym - 学习 - 2)
- 项目地址
- [为什么选择 robogym](#为什么选择 robogym)
- [Rearrange - 环境部分介绍](#Rearrange - 环境部分介绍)
- [Robot Control Interface - 机器人控制接口](#Robot Control Interface - 机器人控制接口)
- [Environment - list](#Environment - list)
- [Environment Randomization - 接口设置](#Environment Randomization - 接口设置)

项目地址

https://github.com/openai/robogym

为什么选择 robogym

自己的项目需要做一些机械臂 table-top 级的多任务操作
robogym 基于 mujoco 搭建，构建了一个仿真机械臂桌面物体操作（pick-place、stack、rearrange）场景
robogym 的例程效果看，支持多个相机示教，包括眼在手上和眼在手外，可以获取多视角视觉信息
robogym 的物体支持 YCB 数据集格式

主要是这些原因，当然，看官方 readme.md 文档，它还有其他不错的功能。

国内主流社区对 robogym 的介绍比较少，所以选择写一些文档记录一下，作为参考。

Rearrange - 环境部分介绍

All the environment classes are subclasses of robogym.robot_env.RobotEnv. The classmethod RobotEnv.build is the main entry point for constructing an environment object, pointed by make_env in each environment.

所有的 具体环境类 都是 RobotEnv 的子类，RobotEnv 的类方法 .build 会通过类继承的方式继承到每个具体环境类中，每个 具体环境类 文件内有变量名 make_env 指代 .build 方法。

The environments extend OpenAI gym and support the reinforcement learning interface offered by gym, including step, reset, render and observe methods.

与 OpenAI gym 提供的强化学习接口类似，这里面的每个环境都有 step, reset, render 和 observe 方法

All environment implementations are under the robogym.envs module and can be instantiated by calling the make_env function. For example, the following code snippet creates a default locked cube environment:
python 复制代码
from robogym.envs.dactyl.locked import make_env
# from robogym.envs.rearrange.blocks import make_env
# from robogym.envs.什么主题(dactyl/rearrange).什么任务(locked/blocks) import make_env
env = make_env()

仿真环境中的机器人配置：a UR16e robot equipped with a RobotIQ 2f-85 gripper

提供了各种目标生成器 ，以便 在给定的物体分布上 抽取一些物体特例于各种任务，例如堆叠（stack）、拾放（pick-and-place）、接触（reach）和重新排列（rearrange）。目标生成器的是从一系列给定的分布中，抽样一些物体放置于仿真环境上，要这样理解。

除了这些简单的任务，还有一些"保留（hold-out）任务"用于评估，是比较难的任务。堆叠积木、餐具重新排列，都是很挑战性的任务。

Robogym提供了一种在训练过程中干预环境参数的方法，以支持领域随机化和课程学习。

下面是一个介入对象数量的示例（重新排列）环境的例子。

使用此接口来定义对象数量。也就是之前一篇文章中用于演示的代码：

python 复制代码

from robogym.envs.rearrange.blocks import make_env

# Create an environment with the default number of objects: 5
# By setting num_objects: 5, and max_num_objects: 8, 
# this environment will sample 5 blocks on by default, 
# while allowing to use the range [1, 8] for num_objects.
env = make_env(
    parameters={
        'simulation_params': {
            'num_objects': 5,
            'max_num_objects': 8,
        }
    }
)

# Acquire number of objects parameter interface
param = env.unwrapped.randomization.get_parameter("parameters:num_objects")

# Set num_objects: 3 for the next episode
param.set_value(3)

# Reset to randomly generate an environment with `num_objects: 3`
obs = env.reset()

大多数机器人环境都支持通过向 make_env 提供附加参数、常量来进行自定义。

您可以通过查看 <EnvName>Constants 类的定义来查找每个环境支持哪些常量参数，该类通常位于与 make_env 相同的文件下。还有 Parameter 参数，可以与常量参数一起自定义。您可以通过查看 <EnvName>Parameters 的定义来找到每个环境支持哪些参数。（在 from ... import ... 里面由类似的）

一些常见的支持常量包括：

randomize：如果为真，则会对物理属性、动作和观察进行随机化。
mujoco_substeps：在仿真器中每个步骤实施后进行的子步数，用于平衡模拟精度和训练速度。
max_timesteps_per_goal：在超时之前允许实现每个目标的最大时间步数。

Robot Control Interface - 机器人控制接口

提供了一个通用的机器人控制框架，针对的是带夹爪的机器人臂的位置控制。

该库实现了针对UR16e机器人臂的控制类，该机器臂具有6个驱动关节和一个1-DOF RobotIQ 2f-85夹爪。

Robogym重新排列环境可以通过 RobotControlParameter 类进行自定义，如本节所述。

Control Mode	Description	Action Dimensions
Joint	Joint position control mode, where joints are actuated via PID or Cascaded PI controllers based on the `arm_joint_calibration_path` specification. See mujoco-py for more details on the low-level controller implementation.【关键：关节角控制、采用PID比例-积分-微分控制器，或者级联比例积分控制器。6个关节代表6个自由度，动作空间的维度就是6维】	6
tcp+roll+yaw (default)	Tool center point (TCP) relative position control in local coordinates and 2-DOF rotation control via wrist rotation and wrist tilt.【关键：通过手腕旋转和手腕倾斜实现本地坐标下的工具中心点（TCP）相对位置控制和2自由度旋转控制。】	5
tcp+wrist	TCP relative position control in local coordinates and 1-DOF rotation control via wrist rotation.【关键：通过手腕旋转实现本地坐标下的工具中心点（TCP）相对位置控制和1自由度旋转控制。】	4

采用两种方法进行动力学计算：

TCP Solver Mode Description

mocap Control is achieved via the MuJoCo mocap mechanism, which is used as a simulation based Inverse Kinematics (IK) solver. In this mode, robot joint dynamics cannot be enforced and motion is dictated by the solver parameters of the MuJoCo sim, which may result in high contact forces and simulation instabilities.【关键：控制是通过MuJoCo mocap机制实现的，该机制被用作基于仿真的逆运动学（IK）求解器。在此模式下，无法强制执行机器人关节动力学，运动由MuJoCo sim的求解器参数决定，这可能导致高接触力和仿真不稳定性。】

mocap_ik (default) This mode is provided for applications that use joint actuated robots that are controllable in the TCP domain. 该模式适用于使用关节驱动机器人的应用程序，这些机器人在TCP域中是可控的。One example of such application would be when a policy is trained to output relative position and rotation actions in the tooltip space, that are physically realized by servoing in the joint space by the robot. 这种应用的一个例子是，当策略被训练为在工具提示空间中输出相对位置和旋转动作时，由机器人在关节空间内伺服实现的物理动作。In this mode, we use a solver simulation that uses the mocap mechanism as described above as an IK solver. The abstract solver interface can also be used to develop an analytical IK solver, however, stability of such solver has been poor in our experience. 在此模式下，使用一个求解器仿真，该仿真使用上述"mocap"机制作为IK求解器。抽象求解器接口也可用于开发分析IK求解器，但是，在经验中，此类求解器的稳定性较差。The positions achieved by this solver simulation are then used as targets to a joint-controlled robot simulation, whose dynamics will be determined by the specific controller.【关键：将此求解器仿真实现的位置，用作关节控制机器人仿真的目标，该仿真的动力学将由特定控制器确定。】

TCP Solver Mode	Description
mocap	Control is achieved via the MuJoCo mocap mechanism, which is used as a simulation based Inverse Kinematics (IK) solver. In this mode, robot joint dynamics cannot be enforced and motion is dictated by the solver parameters of the MuJoCo sim, which may result in high contact forces and simulation instabilities.【关键：控制是通过MuJoCo mocap机制实现的，该机制被用作基于仿真的逆运动学（IK）求解器。在此模式下，无法强制执行机器人关节动力学，运动由MuJoCo sim的求解器参数决定，这可能导致高接触力和仿真不稳定性。】
mocap_ik (default)	This mode is provided for applications that use joint actuated robots that are controllable in the TCP domain. 该模式适用于使用关节驱动机器人的应用程序，这些机器人在TCP域中是可控的。One example of such application would be when a policy is trained to output relative position and rotation actions in the tooltip space, that are physically realized by servoing in the joint space by the robot. 这种应用的一个例子是，当策略被训练为在工具提示空间中输出相对位置和旋转动作时，由机器人在关节空间内伺服实现的物理动作。In this mode, we use a solver simulation that uses the `mocap` mechanism as described above as an IK solver. The abstract solver interface can also be used to develop an analytical IK solver, however, stability of such solver has been poor in our experience. 在此模式下，使用一个求解器仿真，该仿真使用上述"mocap"机制作为IK求解器。抽象求解器接口也可用于开发分析IK求解器，但是，在经验中，此类求解器的稳定性较差。The positions achieved by this solver simulation are then used as targets to a joint-controlled robot simulation, whose dynamics will be determined by the specific controller.【关键：将此求解器仿真实现的位置，用作关节控制机器人仿真的目标，该仿真的动力学将由特定控制器确定。】

每个环境都包含一个可通过 env.robot 访问的机器人对象，该对象实现了 RobotInterface 。对于需要多个机器人的环境，例如机械臂和夹爪，env.robot 是一个 CompositeRobot ，它可以适当地将动作空间分配到不同的机器人实现中。

Environment Name（`*`表示文件下面的全部文件）	Robot Control Parameters	Robot Class	Action Dimension
`dactyl/*`	N/A	`MuJoCoShadowHand`	20
`rearrange/*`	`{'control_mode': 'joint'}`	`MujocoURJointGripperCompositeRobot`	7
`rearrange/*`	`{'control_mode': 'tcp+roll+yaw', 'tcp_solver_mode':'mocap_ik'}`	`MujocoURTcpJointGripperCompositeRobot`	6
`rearrange/*`	`{'control_mode': 'tcp+wrist', 'tcp_solver_mode':'mocap_ik'}`	`MujocoURTcpJointGripperCompositeRobot`	5
`rearrange/*`	`{'control_mode': 'tcp+roll+yaw', 'tcp_solver_mode':'mocap'}`	`MujocoIdealURGripperCompositeRobot`	6
`rearrange/*`	`{'control_mode': 'tcp+wrist', 'tcp_solver_mode':'mocap'}`	`MujocoIdealURGripperCompositeRobot`	5

这里并没有说如何设置这些控制模式，在 make_env 的 parameters 里面设置失败。

Environment - list

训练环境是根据 对象集合 和 目标分布 进行分类的。默认情况下，每个训练环境都设计为生成固定数量的对象。

如果想训练一个具有可变数量对象 的策略，则应使用随机化接口参数 num_objects。

jsonnet 复制代码

constants: {
    # For the reward, we do the following:
    #  - for every object that gets placed correctly, immediately provide +1
    #  - once all objects are properly placed, the policy needs to hold the goal
    #    state for 2 seconds (in simulation time) until it gets `success_reward` reward
    #  - the next goal is generated
    # 奖励评定如下：当智能体正确放置物体，获得+1即时奖励；
    # 当所有物体正确放置，智能体需要维持在仿真环境中的2秒的状态，才能拿到成功的奖励。
    # 计算完之后，随机生成下一个目标。
    
    success_reward: 5.0,
    success_pause_range_s: [0.0, 0.5],
    max_timesteps_per_goal_per_obj: 600,
    vision: True,  # use False if you don't want to use vision observations
    vision_args: {
        image_size: 200,
        camera_names: ['vision_cam_front'],
        mobile_camera_names: ['vision_cam_wrist']
    },
    # 这个goal_args没看懂
    goal_args: {
        rot_dist_type: 'full',
        randomize_goal_rot: true,
        p_goal_hide_robot:: 1.0,
    },
    # 'obj_pos': l2 distance in meter; 'obj_rot' eular angle distance.
    # 'obj_pos'用欧氏距离计算，单位是米；'obj_rot'用欧拉角。
    success_threshold: {'obj_pos': 0.04, 'obj_rot': 0.2},
},
parameters: {
    simulation_params: {
        num_objects: 1,
        max_num_objects: 32,
        object_size: 0.0254,
        # 这个参数没看懂，应该是物体和桌面是一比一大小
        used_table_portion: 1.0,
        goal_distance_ratio: 1.0,
        # 这个参数没看懂，是要把光照产生的阴影计算进去是吗...
        cast_shadows:: false,
        penalty: {
            # Penalty for collisions (wrist camera, table)
            wrist_collision:: 0.0,
            table_collision:: 0.0,
            
            # Penalty for safety stops
            safety_stop_penalty:: 0.0,
        }
    }
}

训练环境列表。

Name	File	Config (overwrite)	Description
blocks reach	blocks_reach.py	-	Place end-of-effector of a robot to the target position. This training environment is not compatible with holdout environments.将机器人的末端执行器放置到目标位置。此训练环境与保留环境不兼容。
blocks	blocks_train.py	-	Pushing blocks to targets on the surface of a table将方块推向桌子表面上的目标。
blocks (push + pick-and-place)	blocks_train.py	`constants.goal_args.pickup_proba: 0.4` 采用拾取动作的概率是0.4	Pushing or pick-and-placing blocks to targets on the surface of a table or in the air将方块推向或拾取并放置到桌面或空中的目标位置。
k-composer	composer.py	`parameters.simulation_params.num_max_geoms: k` 最大的网面数量是 `k`	Pushing objects to targets on the surface of a table. Each objects are created by randomly composing `[1, k]` meshes.将物体推向桌子表面上的目标点位。每个物体都是通过随机组合[1, k]个网面而创建的。
ycb	ycb.py	`parameters.simulation_params.mesh_scale: 0.6`	Pushing ycb objects to targets on the surface of a table将ycb格式的方块推向桌子表面上的目标。
mixture	mixture.py	`constants: {normalize_mesh: True, normalized_mesh_size: 0.05}` `parameters.simulation_params.mesh_scale: 1.0`	Pushing objects to targets on the surface of a table, objects are randomly sample from ycb or simple geom shapes.将物体推向桌子表面上的目标，物体是从ycb或简单几何形状中随机抽样的。

设计了一组保留环境，以评估学习策略的泛化性能。

为了使保留环境与推荐的训练环境兼容，建议从训练环境继承相同的默认配置。

下表描述了一系列保留环境及其配置。https://github.com/openai/robogym/blob/master/docs/list_rearrange_env.md

作为所有保留环境的通用配置，n 个物体的保留环境会覆盖原有的物体。 parameters.simulation_params.num_objects: n

Environment Randomization - 接口设置

Robogym提供了一种干预训练过程中环境参数的方式，以支持领域随机化和课程学习。这个接口称为随机化。随机化用于随机化环境的各个方面，如初始状态分布、目标分布和转移动力学。本文档描述了使用随机化修改在blocks_train 环境中采样的对象数量的示例。