Reinforcement-Learning

Math-Shepherd

This is the talk and presentation I’ve given during seminar “Process Reward Modeling in LLMs” at the University of Heidelberg. It involves a presentation and a short academic discussions, the content is about paper sharing, experiments, and reproduction results with classmates and professors. The paper name is “Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations” (Wang et al., 2024). You could find the paper and slide here: Annotated Paper (PDF) | Preview Slides (PDF) ...

Reinforcement Learning

In machine learning, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning. (Wikipedia) Notation Symbol Name Description $s_t$ State The observation/input from the environment at time $t$. $a_t$ Action The decision made by the agent at time $t$. $r_t$ Reward The feedback signal received after taking an action. $\pi$ Policy The agent’s action selection strategy (a mapping from states to actions). $\gamma$ Discount Factor A value (0 to 1) that determines how much the agent cares about future rewards or immediate ones. $T$ Numer of steps The length of one trajectory. $G_t$ Return The total accumulated (and usually discounted) reward from time $t$ onwards. $V(s)$ Value Function The expected return starting from state $s$. $Q(s, a)$ Q-Value The expected return starting from state $s$ and taking action $a$. $\theta$ Parameters The weights of the neural network representing the policy or value function. $\alpha$ Learning Rate The step size used when updating the agent’s knowledge (parameters). $\tau$ Trajectory A sequence of states, actions, and rewards $(s_0, a_0, r_0, s_1, …)$. $J(\theta)$ Objective Function A measure of how good the current policy is (usually the expected total reward). $\nabla_\theta$ Gradient The direction and magnitude of the change needed for $\theta$ to increase $J$. $D$ Dataset Training dataset for supervised and unsupervised learning. Basics In supervised and unsupervised learning, the model is trained on a static dataset to identify underlying patterns. The update signal is derived entirely from the fixed provided data. And, there is no interaction with an external system… ...

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Paper-reading notes: AlphaZero

Mastering the game of Go without human knowledge

Paper-reading notes: AlphaGo Zero

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper-reading notes: DeepSeek-R1 - Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Mastering the game of Go with MCTS and Deep Neural Networks

Paper-reading notes: Mastering the game of Go with MCTS and Deep Neural Networks