<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Posts on Tan Ke</title>
    <link>https://mrtanke.github.io/tags/posts/</link>
    <description>Recent content in Posts on Tan Ke</description>
    <generator>Hugo -- 0.160.1</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 19 Apr 2026 21:47:41 +0000</lastBuildDate>
    <atom:link href="https://mrtanke.github.io/tags/posts/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Reinforcement Learning</title>
      <link>https://mrtanke.github.io/posts/2026-04-19-reinforcement-learning/</link>
      <pubDate>Sun, 19 Apr 2026 21:47:41 +0000</pubDate>
      <guid>https://mrtanke.github.io/posts/2026-04-19-reinforcement-learning/</guid>
      <description>&lt;p&gt;In &lt;a href=&#34;https://en.wikipedia.org/wiki/Machine_learning&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;machine learning&lt;/a&gt;, &lt;strong&gt;reinforcement learning&lt;/strong&gt; (&lt;strong&gt;RL&lt;/strong&gt;) is concerned with how an &lt;a href=&#34;https://en.wikipedia.org/wiki/Intelligent_agent&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;intelligent agent&lt;/a&gt; should &lt;a href=&#34;https://en.wikipedia.org/wiki/Action_selection&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;take actions&lt;/a&gt; in a dynamic environment in order to &lt;a href=&#34;https://en.wikipedia.org/wiki/Reward-based_selection&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;maximize a reward&lt;/a&gt; signal. Reinforcement learning is one of the &lt;a href=&#34;https://en.wikipedia.org/wiki/Machine_learning#Approaches&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;three basic machine learning paradigms&lt;/a&gt;, alongside &lt;a href=&#34;https://en.wikipedia.org/wiki/Supervised_learning&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;supervised learning&lt;/a&gt; and &lt;a href=&#34;https://en.wikipedia.org/wiki/Unsupervised_learning&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;unsupervised learning&lt;/a&gt;. (&lt;a href=&#34;https://en.wikipedia.org/wiki/Reinforcement_learning&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Wikipedia&lt;/a&gt;)&lt;/p&gt;
&lt;h1 id=&#34;notation&#34;&gt;Notation&lt;/h1&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;&lt;strong&gt;Symbol&lt;/strong&gt;&lt;/th&gt;
          &lt;th&gt;&lt;strong&gt;Name&lt;/strong&gt;&lt;/th&gt;
          &lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;$s_t$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;State&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The observation/input from the environment at time $t$.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$a_t$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Action&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The decision made by the agent at time $t$.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$r_t$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Reward&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The feedback signal received after taking an action.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$\pi$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Policy&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The agent&amp;rsquo;s action selection strategy (a mapping from states to actions).&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$\gamma$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Discount Factor&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;A value (0 to 1) that determines how much the agent cares about future rewards or immediate ones.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$T$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Numer of steps&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The length of one trajectory.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$G_t$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Return&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The total accumulated (and usually discounted) reward from time $t$ onwards.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$V(s)$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Value Function&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The expected return starting from state $s$.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$Q(s, a)$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Q-Value&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The expected return starting from state $s$ and taking action $a$.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$\theta$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Parameters&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The weights of the neural network representing the policy or value function.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$\alpha$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Learning Rate&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The step size used when updating the agent&amp;rsquo;s knowledge (parameters).&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$\tau$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Trajectory&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;A sequence of states, actions, and rewards $(s_0, a_0, r_0, s_1, &amp;hellip;)$.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$J(\theta)$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Objective Function&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;A measure of how good the current policy is (usually the expected total reward).&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$\nabla_\theta$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Gradient&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;The direction and magnitude of the change needed for $\theta$ to increase $J$.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;$D$&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Dataset&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Training dataset for supervised and unsupervised learning.&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h1 id=&#34;basics&#34;&gt;Basics&lt;/h1&gt;
&lt;p&gt;In supervised and unsupervised learning, the model is trained on a static dataset to identify underlying patterns. The update signal is derived entirely from the fixed provided data. And, there is no interaction with an external system…&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
