RT-1: ROBOTICS TRANSFORMERFOR REAL-WORLD CONTROL AT SCALE

摘要

By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. （通过利用大量的多样化的任务通用的数据集，这些数据转化为知识，现在大模型都能够高水平的解决特定的下有任务，无论是零样本方式还是少量的指定任务数据集，也就是语言大模型现在已经得到验证，利用强化学习能够使用一个大模型，解决各种任务）。While this capabilityhas been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data。（尽管这个方法已经在计算机视觉，自然语言处理，语音识别任务中，但是在机器恩这个领域任然需要验证，由于机器人收集真实数据比较困难，所以模型的通用泛化能力非常重要）。We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data（我们相信存在一个高质量的架构能够很好地利用多样化的机器人数据，并且训练的关键在于模型的训练是开放式的，并且无某一个任务无关性，或者换句话说是不针对特定任务，而是对所有任务的通用训练，类似于现在的语言大模型）。In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties(这篇文章，我们提出了机器人的transformer，显示出很有前瞻行的模型)

introduction

End-to-end robotic learning, with either imitation or reinforcement, typically involves collecting task-specific data in either single-task (Kalashnikov et al., 2018; Zhang et al., 2018) or multitask (Kalashnikov et al., 2021b; Jang et al., 2021) settings that are narrowly tailored to the tasks that the robot should perform.（现在端到端的机器人学习，主要是强化学习及模仿学习，主要收集指定任务的训练数据，或者是专门为机器人多任务设置的数据）。This workflow mirrors the classic approach to supervised learning in other domains, such as computer vision and NLP, where task-specific datasets would be collected, labeled, and deployed to solve individual tasks, with little interplay between the tasks themselves.（）这些工作流与类似于计算机视觉或nlp领域类似，需要再指定任务收集数据，标注，部署来解决独立的任务，多个任务之间互动性很小。Recent years have seen a transformation in vision, NLP, and other domains, away from siloed, small scale datasets and models and towards large, general models pre-trained on broad, large datasets.（近些年来，在视觉和nlp的领域逐渐从孤立，小数据，单任务训练逐渐转变为大的，一般通用模型和大量数据的模型训练）。The keys to the success of such models lie with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the knowledge present in large-scale datasets.（这些模型关键因素在于能够使用开放的，任务独立的训练，主要在于好的架构能够学习大量的知识从非常多的数据中学习）If a model can "sponge up" experience to learn general patterns in language or perception, then it can bring them to bear on individual tasks more efficiently（如果一个模型能够吸收大量知识，并且能够学到通用的模式，那么这个模型能够更加有效的应用于单个任务中）。While removing the need for large taskspecific datasets is appealing generally in supervised learning, it is even more critical in robotics, where datasets might require engineering-heavy autonomous operation or expensive human demonstrations.（能够去除针对特定任务的大量数据标注工作在监督学习中是非常有吸引力的，这在robotics中也是非常重要的，由于需要复杂的昂贵的标注工作，甚至需要人类的演示）。We therefore ask: can we train a single, capable, large multi-task backbone model on data consisting of a wide variety of robotic tasks? （我们是否能训练一个能力比较强支撑多任务的基础模型，并且能够对于新任务利用零样本进行适配，环境或者objects）