Reproducing Robotics Transformer 1
By the end of the Christmas holidays, I continued my VLA (Vision–Language–Action) learning track. I carefully read two papers: RT-1: Robotics Transformer for Real-World Control at Scale (Brohan et al., 2022) and RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (Brohan et al., 2023) while writing my reading notes here: https://mrtanke.github.io/posts/2026-01-09-rt-series/. After finishing the notes, I decided to reproduce Robotics Transformer 1 (RT-1) in PyTorch, not to build a production system, but to truly understand the design decisions and implement the core ideas from the paper end-to-end. The goal is a learning-oriented, minimal implementation that stays close to the RT-1 architecture, while keeping the codebase clean and readable. Since training RT-1 at scale requires a heavy TFDS/RLDS pipeline and large real-robot datasets, I intentionally kept the data side minimal: I use a synthetic dataset that mirrors RT-1’s input and output shapes to validate the model forward pass, action tokenization, and the behavioral cloning training loop. ...