Math-Shepherd

This is the talk and presentation I’ve given during seminar “Process Reward Modeling in LLMs” at the University of Heidelberg. It involves a presentation and a short academic discussions, the content is about paper sharing, experiments, and reproduction results with classmates and professors. The paper name is “Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations” (Wang et al., 2024). You could find the paper and slide here: Annotated Paper (PDF) | Preview Slides (PDF) ...

May 1, 2026 | 2808 words | Author: Tan Ke

Reinforcement Learning

In machine learning, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning. (Wikipedia) Notation Symbol Name Description $s_t$ State The observation/input from the environment at time $t$. $a_t$ Action The decision made by the agent at time $t$. $r_t$ Reward The feedback signal received after taking an action. $\pi$ Policy The agent’s action selection strategy (a mapping from states to actions). $\gamma$ Discount Factor A value (0 to 1) that determines how much the agent cares about future rewards or immediate ones. $T$ Numer of steps The length of one trajectory. $G_t$ Return The total accumulated (and usually discounted) reward from time $t$ onwards. $V(s)$ Value Function The expected return starting from state $s$. $Q(s, a)$ Q-Value The expected return starting from state $s$ and taking action $a$. $\theta$ Parameters The weights of the neural network representing the policy or value function. $\alpha$ Learning Rate The step size used when updating the agent’s knowledge (parameters). $\tau$ Trajectory A sequence of states, actions, and rewards $(s_0, a_0, r_0, s_1, …)$. $J(\theta)$ Objective Function A measure of how good the current policy is (usually the expected total reward). $\nabla_\theta$ Gradient The direction and magnitude of the change needed for $\theta$ to increase $J$. $D$ Dataset Training dataset for supervised and unsupervised learning. Basics In supervised and unsupervised learning, the model is trained on a static dataset to identify underlying patterns. The update signal is derived entirely from the fixed provided data. And, there is no interaction with an external system… ...

April 19, 2026 | 2075 words | Author: Tan Ke

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Paper-reading notes: AlphaZero
November 24, 2025 | 360 words | Author: Tan Ke

Mastering the game of Go without human knowledge

Paper-reading notes: AlphaGo Zero
November 24, 2025 | 342 words | Author: Tan Ke

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper-reading notes: DeepSeek-R1 - Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
November 4, 2025 | 2299 words | Author: Tan Ke

Mastering the game of Go with MCTS and Deep Neural Networks

Paper-reading notes: Mastering the game of Go with MCTS and Deep Neural Networks
October 24, 2025 | 2246 words | Author: Tan Ke