Posts

Introduction to Neural Networks and Seq2Seq Learning

During the summer semester of 2026, I took Introduction to Neural Networks and Sequence-to-Sequence Learning course. As you see, the main content is from perceptrons to modern sequence models. I started learning machine learning through Li Mu’s courses. I spent several weeks systemically studying his materials and organized the corresponding code in the repo. Also, I had already used neural networks in multiple projects before taking this course. I had trained language models, vision models, vision-language models, reinforcement-learning agents and even vision-language-action models… ...

Math-Shepherd

This is the talk and presentation I’ve given during seminar “Process Reward Modeling in LLMs” at the University of Heidelberg. It involves a presentation and a short academic discussions, the content is about paper sharing, experiments, and reproduction results with classmates and professors. The paper name is “Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations” (Wang et al., 2024). You could find the paper and slide here: Annotated Paper (PDF) | Preview Slides (PDF) ...

Reinforcement Learning

In machine learning, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning. (Wikipedia) Notation Symbol Name Description $s_t$ State The observation/input from the environment at time $t$. $a_t$ Action The decision made by the agent at time $t$. $r_t$ Reward The feedback signal received after taking an action. $\pi$ Policy The agent’s action selection strategy (a mapping from states to actions). $\gamma$ Discount Factor A value (0 to 1) that determines how much the agent cares about future rewards or immediate ones. $T$ Numer of steps The length of one trajectory. $G_t$ Return The total accumulated (and usually discounted) reward from time $t$ onwards. $V(s)$ Value Function The expected return starting from state $s$. $Q(s, a)$ Q-Value The expected return starting from state $s$ and taking action $a$. $\theta$ Parameters The weights of the neural network representing the policy or value function. $\alpha$ Learning Rate The step size used when updating the agent’s knowledge (parameters). $\tau$ Trajectory A sequence of states, actions, and rewards $(s_0, a_0, r_0, s_1, …)$. $J(\theta)$ Objective Function A measure of how good the current policy is (usually the expected total reward). $\nabla_\theta$ Gradient The direction and magnitude of the change needed for $\theta$ to increase $J$. $D$ Dataset Training dataset for supervised and unsupervised learning. Basics In supervised and unsupervised learning, the model is trained on a static dataset to identify underlying patterns. The update signal is derived entirely from the fixed provided data. And, there is no interaction with an external system… ...

GPU

In this post, I’ll walk through GPUs and CUDA. Hope it helps with my final exam and AI learning… The full name of GPU is Graphics Processing Unit. Looking back at its history. GPU first appeared as fixed-function hardware to speed up parallel work in real-time 3D graphics. Over time, GPUs became more programmable. By 2003, parts of the graphics pipeline were fully programmable, running custom code in parallel for many elements of a 3D scene or an image. ...

Scenestreamer

Paper Reading Notes: “Scenestreamer: Continuous Scenario Generation As Next Token Group Prediction”

InSpatio-WorldFM

Paper Reading Notes: “InSpatio-WorldFM” An Open-Source Real-Time Generative Frame Model"

Evolution and Ablation of Robotic World Models

World Models Paper | Homepage There is an interactive loop between the agent and the environment. The agent observes the environment, takes an action in response, and then the environment changes accordingly. The agent model can be viewed as the brain of the agent: it is the overall decision-making system that enables the agent to perceive the environment, maintain temporal context, and choose actions. An typical agent model has three components, three models: ...

A Review of Robbyant’s Early-2026 Work

Robbyant is a company under Ant Group, dedicated to building the foundational platform for Embodied AI, bridging the gap between digital intelligence and the physical world. Since the company is still relatively new, I want to quickly review its recent work. In particular, I will study four embodied intelligence model models: spatial perception model, VLA model, world model, and video action model. This diagram in the homepage of Robbyant reflects the vision for embodied intelligence: starting from sensory input, the system first builds spatial intelligence to understand the physical world, then relies on an action model to make decisions and interact with the environment, and finally improves through environmental reward. ...

π Series (π₀, π₀.₅)

Physical Intelligence is a fast-rising company focused on bringing general-purpose AI into the physical world. In under two years since introducing their first VLA prototype model π₀, thet’ve made a huge impact in the embodied intelligence community. In this post, I’ll walk through the three main VLA models they’ve released so far, based on my reading of their blogs and papers. π₀ π₀ is a vision-language-action (VLA) model built on top of a pre-trained vision–language model (VLM) backbone. It is then robot-pretrained on a large mixture of open-source and in-house manipulation datasets to learn broad, general skills, and can be further post-trained on smaller, task-specific data to specialize for downstream applications. ...

Seminar Summary of "Optimization in Machine Learning"

The summary of the seminar “Optimization in Machine Learning”, covering Bayesian Optimization, multi-fidelity methods, handling discrete search spaces, and the BANANAS method for NAS.