Vision-Language-Action

π Series (π₀, π₀.₅)

Physical Intelligence is a fast-rising company focused on bringing general-purpose AI into the physical world. In under two years since introducing their first VLA prototype model π₀ , thet’ve made a huge impact in the embodied intelligence community. In this post, I’ll walk through the three main VLA models they’ve released so far, based on my reading of their blogs and papers. π₀ π₀ is a vision-language-action (VLA) model built on top of a pre-trained vision–language model (VLM) backbone. It is then robot-pretrained on a large mixture of open-source and in-house manipulation datasets to learn broad, general skills, and can be further post-trained on smaller, task-specific data to specialize for downstream applications. ...

Repo Reading Notes for OpenPI

After reading the paper: π0: A Vision-Language-Action Flow Model for General Robot Control , I decided to spend a few days walking through the official implementation, openpi , to understand how everything work in practice. There are several questions I want to find answer. On the big side: how does this repo turn VLM features into robot actions, and how are training and inference actually wired together? On the smaller side: how is the two-expert MoE implemented, and how do observations influence the final action output? ...

Reproducing Robotics Transformer 1

By the end of the Christmas holidays, I continued my VLA (Vision–Language–Action) learning track. I carefully read two papers: RT-1: Robotics Transformer for Real-World Control at Scale (Brohan et al., 2022 ) and RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (Brohan et al., 2023 ) while writing my reading notes here: https://mrtanke.github.io/posts/2026-01-09-rt-series/ . After finishing the notes, I decided to reproduce Robotics Transformer 1 (RT-1) in PyTorch, not to build a production system, but to truly understand the design decisions and implement the core ideas from the paper end-to-end. The goal is a learning-oriented, minimal implementation that stays close to the RT-1 architecture, while keeping the codebase clean and readable. Since training RT-1 at scale requires a heavy TFDS/RLDS pipeline and large real-robot datasets, I intentionally kept the data side minimal: I use a synthetic dataset that mirrors RT-1’s input and output shapes to validate the model forward pass, action tokenization, and the behavioral cloning training loop. ...

RT Series (RT-1, RT-2)

Paper-reading notes: RT-1 and RT-2

Reproducing Diffusion Policy

At the end of 2025, I spent a few days reproducing Diffusion Policy from Diffusion Policy: Visuomotor Policy Learning via Action Diffusion . I first spent about one day to go through the paper. If you are interested, feel free to check paper reading note . The work is impressive, so I decided to reproduce it over the Christmas break. This is the repo address https://github.com/mrtanke/diffusion-policy . Repo skeleton diffusion-policy/ ├── diffusion_policy/ # Library code (importable package) │ ├── __init__.py # Package marker │ ├── checkpoint.py # Save/load checkpoints │ ├── normalizer.py # Min-max normalization to/from [-1, 1] │ ├── data/ │ │ ├── pusht_zarr_dataset.py # Load PushT replay data and return training samples: observation history + future action trajectory │ │ └── sequence_utils.py # Builds the start/end indices for each fixed-length training sample/window within episode │ └── models/ │ ├── diffusion.py # DiffusionPolicy training and sampling wrapper │ ├── denoisers.py # Temporal UNet denoiser / noise predictor │ └── encoders.py # Observation encoder ├── train.py # Main training entrypoint ├── eval_pusht.py # Eval script for PushT └── data/pusht/ # Local dataset folder (pusht_cchi_v7_replay.zarr/) Core algorithm We want to generate an expert action trajectory by denoising a noise action trajectory, just like what Image diffusion do. To do this, we train a model to predicted the noise contained in each action from a noise action tractory. Then we use the predicted noise to gradually denoise the noise action trajectory. ...

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Paper-reading notes: Diffusion Policy

OpenVLA: An Open-Source Vision-Language-Action Model

Paper-reading notes: OpenVLA