Reproducing Robotics Transformer 1

By the end of the Christmas holidays, I continued my VLA (Vision–Language–Action) learning track. I carefully read two papers: RT-1: Robotics Transformer for Real-World Control at Scale (Brohan et al., 2022) and RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (Brohan et al., 2023) while writing my reading notes here: https://mrtanke.github.io/posts/2026-01-09-rt-series/. After finishing the notes, I decided to reproduce Robotics Transformer 1 (RT-1) in PyTorch, not to build a production system, but to truly understand the design decisions and implement the core ideas from the paper end-to-end. The goal is a learning-oriented, minimal implementation that stays close to the RT-1 architecture, while keeping the codebase clean and readable. Since training RT-1 at scale requires a heavy TFDS/RLDS pipeline and large real-robot datasets, I intentionally kept the data side minimal: I use a synthetic dataset that mirrors RT-1’s input and output shapes to validate the model forward pass, action tokenization, and the behavioral cloning training loop. ...

January 10, 2026 · 2280 words

RT Series

Paper-reading notes: RT-1 and RT-2
January 9, 2026 · 1731 words

Reproducing Diffusion Policy

At the end of 2025, I spent a few days reproducing Diffusion Policy from Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. I first spent about one day to go through the paper. If you are interested, feel free to check paper reading note. The work is impressive, so I decided to reproduce it over the Christmas break. This is the repo address https://github.com/mrtanke/diffusion-policy. Repo skeleton diffusion-policy/ ├── diffusion_policy/ # Library code (importable package) │ ├── __init__.py # Package marker │ ├── checkpoint.py # Save/load checkpoints │ ├── normalizer.py # Min-max normalization to/from [-1, 1] │ ├── data/ │ │ ├── pusht_zarr_dataset.py # Load PushT replay data and return training samples: observation history + future action trajectory │ │ └── sequence_utils.py # Builds the start/end indices for each fixed-length training sample/window within episode │ └── models/ │ ├── diffusion.py # DiffusionPolicy training and sampling wrapper │ ├── denoisers.py # Temporal UNet denoiser / noise predictor │ └── encoders.py # Observation encoder ├── train.py # Main training entrypoint ├── eval_pusht.py # Eval script for PushT └── data/pusht/ # Local dataset folder (pusht_cchi_v7_replay.zarr/) Core algorithm We want to generate an expert action trajectory by denoising a noise action trajectory, just like what Image diffusion do. To do this, we train a model to predicted the noise contained in each action from a noise action tractory. Then we use the predicted noise to gradually denoise the noise action trajectory. ...

January 2, 2026 · 2442 words

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Paper-reading notes: Diffusion Policy
December 28, 2025 · 672 words

OpenVLA: An Open-Source Vision-Language-Action Model

Paper-reading notes: OpenVLA
December 12, 2025 · 312 words