Skip to content

Commit a9d5e62

Browse files
committed
adding PPO and GAE initils to rl
1 parent ba71c35 commit a9d5e62

2 files changed

Lines changed: 293 additions & 60 deletions

File tree

rl.html

Lines changed: 181 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99
<title>My Document Title</title>
1010
<style>
1111
html {
12-
font-size: 12pt;
1312
color: #1a1a1a;
1413
background-color: #fdfdfd;
1514
}
@@ -180,57 +179,21 @@ <h1 class="title">My Document Title</h1>
180179
<p class="date">2025-05-02</p>
181180
</header>
182181
<nav id="TOC" role="doc-toc">
183-
<ul>
184-
<li><a href="#short-intro-to-rl" id="toc-short-intro-to-rl">Short intro
185-
to RL</a>
186-
<ul>
187-
<li><a href="#reward-and-return" id="toc-reward-and-return">Reward and
188-
Return</a></li>
189-
<li><a href="#the-goal" id="toc-the-goal">The Goal</a></li>
190-
<li><a href="#a-few-useful-quantities"
191-
id="toc-a-few-useful-quantities">A few useful quantities</a></li>
192-
</ul></li>
193-
<li><a href="#policy-optimization-methods."
194-
id="toc-policy-optimization-methods.">Policy optimization methods.</a>
195-
<ul>
196-
<li><a href="#some-warmup." id="toc-some-warmup.">Some warmup.</a>
197-
<ul>
198-
<li><a href="#exact-policy-iteration."
199-
id="toc-exact-policy-iteration.">Exact policy iteration.</a></li>
200-
<li><a href="#policy-gradient." id="toc-policy-gradient.">Policy
201-
gradient.</a></li>
202-
</ul></li>
203-
<li><a href="#approximate-value-functions-methods."
204-
id="toc-approximate-value-functions-methods.">Approximate value
205-
functions methods.</a>
206-
<ul>
207-
<li><a href="#mixture-policies." id="toc-mixture-policies.">Mixture
208-
policies.</a></li>
209-
<li><a href="#beyond-mixture-policies---trpo"
210-
id="toc-beyond-mixture-policies---trpo">Beyond mixture policies -
211-
TRPO</a></li>
212-
<li><a href="#almost-trpo-ppo" id="toc-almost-trpo-ppo">Almost TRPO:
213-
PPO</a></li>
214-
</ul></li>
215-
<li><a href="#implementations-and-their-consequences."
216-
id="toc-implementations-and-their-consequences.">Implementations and
217-
their consequences.</a>
218-
<ul>
219-
<li><a href="#generalized-advantage-estimation."
220-
id="toc-generalized-advantage-estimation.">Generalized advantage
221-
estimation.</a></li>
222-
<li><a href="#grpo" id="toc-grpo">GRPO</a></li>
223-
</ul></li>
224-
</ul></li>
225-
<li><a href="#references" id="toc-references">References</a></li>
226-
</ul>
182+
227183
</nav>
228184
<p><em>A few preliminary remarks.</em> This page is intended to be a
229185
work, where I collect resources, derivations, ideas etc. that I found
230186
useful when diving into the world of RL. In particular, it is not a
231187
hand-held, explain everything type introduction, but rather one that a
232188
person with some mathematical affinity (not much) could understand. As
233189
such, not everything is discussed in detail.</p>
190+
<p>One way I believe it is really helpful to think about RL, especially
191+
within the context of this note is the connection between Differential
192+
Geometry and Riemannian Geometry: Fundamentally, ideas often come from
193+
the classical optimization literature, but have a different flavour due
194+
to the highly structured nature of the functions we are trying to
195+
optimize. I will try my best to give references to optimization material
196+
whenever it is appropiate to connect these fields.</p>
234197
<h1 id="short-intro-to-rl">Short intro to RL</h1>
235198
<p><em>This section is inspired by <a
236199
href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient">1</a>.</em></p>
@@ -300,6 +263,38 @@ <h3 id="a-few-useful-quantities">A few useful quantities</h3>
300263
action with respect to a policy <span
301264
class="math inline">\(\pi\)</span>: <span
302265
class="math display">\[A_\pi(s, a) = Q_\pi(s, a) - V_\pi(s)\]</span></p>
266+
<h3 id="temporal-difference-learning.">Temporal difference
267+
learning.</h3>
268+
<p>In practice the value function is not known, and can be complicated
269+
to compute explicitly. One way though, in which one can try to deduce it
270+
is the following. A self-consistency relation of <span
271+
class="math inline">\(V\)</span> is the following: <span
272+
class="math display">\[V_\pi(s) = E_{a \sim \pi(\cdot | s) } \left[ r(s,
273+
a) + V(s&#39;) \right]\]</span> Thus given a guess <span
274+
class="math inline">\(V\)</span> for the value function, one use can
275+
tweak <span class="math inline">\(V\)</span> to promote
276+
self-consistency:</p>
277+
<pre><code>Initialize V randomly
278+
279+
while not converged:
280+
Pick a state s randomly
281+
Pick an action a according to pi
282+
V(s) := (1 - a) V(s) + a (r(s, a) + \gamma V(s&#39;))</code></pre>
283+
<p>We achieve perfect self-consistency if the <em>TD-residual of V with
284+
discount <span class="math inline">\(\gamma\)</span></em>, given by
285+
<span class="math display">\[\delta_{a}^{V_{\pi, \gamma}}(s, s&#39;) =
286+
r(s, a) + \gamma V_{\pi}(s&#39;) - V_{\pi}(s)\]</span> satisfies <span
287+
class="math inline">\(E_{a \sim \pi(\cdot|s)} [\delta_{a}^{V^{\pi,
288+
\gamma}}(s, s&#39;)] = 0 \quad (\dagger)\)</span>.</p>
289+
<p>The TD-residual has another application, which we will see in the
290+
section on <a href="#generalized-advantage-estimation">GAE</a>. But to
291+
give you a foretaste, let’s fix <span class="math inline">\(s\)</span>
292+
and <span class="math inline">\(a\)</span> in <span
293+
class="math inline">\((\dagger)\)</span>, and take the expectation with
294+
respect to <span class="math inline">\(s&#39;\)</span>. We get: <span
295+
class="math display">\[\mathbb{E}_{s&#39;}[\delta_a^{V_{\pi, \gamma}}] =
296+
A_{\pi, \gamma}(s, a)\]</span> Without giving too much away, this idea
297+
can be iterated to reduce the variance of this estimation.</p>
303298
<h1 id="policy-optimization-methods.">Policy optimization methods.</h1>
304299
<p>The methods we are going to encounter are all based on iteratively
305300
improving a policy that we currently have to a “better one”. To make the
@@ -363,6 +358,55 @@ <h3 id="policy-gradient.">Policy gradient.</h3>
363358
&amp;= E_{\tau \sim \pi_\theta} [\left( \sum_i \nabla_\theta \log
364359
\pi_{\theta}(a_i | s_i) \right) R(\tau)]
365360
\end{align*}\]</span></p>
361+
<p>Set <span class="math inline">\(R_i(\tau) = \sum_{j \geq t}
362+
r_j\)</span>. Then, <span class="math inline">\(R(\tau) = \sum_{i &lt;
363+
t} r_i + R_t(\tau)\)</span>, and consequentially: <span
364+
class="math display">\[\begin{align*}
365+
E_{\tau \sim \pi_\theta} [\left( \sum_i \nabla_\theta \log
366+
\pi_{\theta}(a_i | s_i) \right) R(\tau)] &amp;= \sum_i E_{\tau \sim
367+
\pi_\theta}[\left(\nabla_\theta \log \pi_{\theta}(a_i | s_i) \right)
368+
R(\tau)]\\
369+
&amp;= \sum_i E_{\tau \sim \pi_\theta}[\left(\nabla_\theta \log
370+
\pi_{\theta}(a_i | s_i) \right) (R_i + \sum_{j&lt;t} r_j)]\\
371+
&amp;= \sum_i E_{\tau \sim \pi_\theta}[\left(\nabla_\theta \log
372+
\pi_{\theta}(a_i | s_i) \right) R_i]
373+
\end{align*}\]</span></p>
374+
<p>Where the last inequality follows from the fact that the action <span
375+
class="math inline">\(a_t\)</span> is independent of the states <span
376+
class="math inline">\(s_i\)</span> for <span class="math inline">\(i
377+
&lt; t\)</span>. This gives the <strong>past independent form</strong>
378+
of the policy gradient: <span
379+
class="math display">\[\boxed{\nabla_\theta J(\theta) = E_{\tau \sim
380+
\pi_\theta} [\left( \sum_i \nabla_\theta \log \pi_{\theta}(a_i | s_i)
381+
\right) R_i(\tau)]}\]</span></p>
382+
<p>Another useful trick is to notice that for <span
383+
class="math inline">\(s\)</span> fixed, <span
384+
class="math inline">\(\pi_\theta(\cdot|s)\)</span> becomes a
385+
parameterized distribution on <span class="math inline">\(A\)</span>. It
386+
is a general result, that <span class="math inline">\(E_{a \sim
387+
\pi_{(\cdot|s)}} [\nabla_\theta \log \pi_\theta(a|s)] = 0\)</span>. Upon
388+
multiplying this with a constant <span
389+
class="math inline">\(b(s_t)\)</span>, we obtain that <span
390+
class="math display">\[E_{a \sim \pi_{(\cdot|s)}} [\nabla_\theta \log
391+
\pi_\theta(a|s) b(s)] = 0\]</span> We can also write <span
392+
class="math display">\[\begin{align*}
393+
E_{\tau \sim \pi_{\theta}}[\nabla_\theta \log \pi_\theta(a_i|s_i)
394+
b(s_i)] &amp;= \sum_{s, a} P(s_i = a, a_i = a) \nabla_\theta \log
395+
\pi_\theta(a_i|s_i) b(s_i)\\
396+
&amp;= \sum_s P(s_i = s) \left[ \sum_a P(a_i = a | s_i = s)
397+
\nabla_\theta \log \pi_\theta(a_i|s_i) b(s_i) \right]\\
398+
&amp;= \sum_s P(s_i = s) \left[ E_{a \sim \pi_\theta(\cdot|s)}
399+
[\nabla_\theta \log \pi_\theta(a|s) b(s_i)] \right] = 0
400+
\end{align*}\]</span></p>
401+
<p>Using this claim, we can derive another unbiased estimator for the
402+
policy gradient:</p>
403+
<p><span class="math display">\[\boxed{\nabla_\theta J(\theta) = E_{\tau
404+
\sim \pi_\theta} [\left( \sum_i \nabla_\theta \log \pi_{\theta}(a_i |
405+
s_i) \right) \left( R_i(\tau) + b(s_t) \right)]} \]</span></p>
406+
<p>The main difference between these is that some of them have lower
407+
variance than the other. While it is not instrumental to know all these
408+
derivations inside out, it is useful to be aware of the different forms
409+
of the policy gradient.</p>
366410
<h2 id="approximate-value-functions-methods.">Approximate value
367411
functions methods.</h2>
368412
<p>One major drawback of Exact Policy Iteration is the fact that it is
@@ -446,8 +490,8 @@ <h3 id="mixture-policies.">Mixture policies.</h3>
446490
\varepsilon}{1- \gamma(1 - \alpha)} \right)\]</span></p>
447491
<h3 id="beyond-mixture-policies---trpo">Beyond mixture policies -
448492
TRPO</h3>
449-
<p><em>The idea of this section was originally published in <a
450-
href="https://arxiv.org/abs/1502.05477">here</a></em>.</p>
493+
<p><em>The idea of this section was originally published in
494+
[here][3]</em>.</p>
451495
<p>Fundamentally, we are motivated by the following thought, that
452496
extends the result of the previous section:</p>
453497
<blockquote>
@@ -504,16 +548,93 @@ <h3 id="beyond-mixture-policies---trpo">Beyond mixture policies -
504548
pi_new = argmax_pi L_{pi_i}(pi)
505549
such that CD_KL_max (pi_old, pi) &lt;= delta</code></pre>
506550
</blockquote>
507-
<p>One final thing to note is that one rarely encounters TRPO in the
508-
form we have just described. The following form of the objective is
509-
used:</p>
551+
<p><strong>Implementation.</strong> We have enough machinery now to
552+
think about how we might want to implement this method. We need to come
553+
up with a way to approxiamte <span
554+
class="math display">\[L_{\theta_{\text{old}}}(\theta) = \sum_s
555+
\rho_{\text{old}}(s) \sum_a \pi_\theta(a | s)
556+
A_{\theta_{\text{old}}}(s,a)\]</span></p>
557+
<p>The first and most obvious thing that needs to happen, is
558+
approximating <span class="math inline">\(\sum_s
559+
\rho_{\text{old}}(s)[\cdots]\)</span> with a montecarlo estiamte based
560+
on our data.</p>
561+
<p>Then, we need a way to estimate the advantage <span
562+
class="math inline">\(A_{\theta_{\text{old}}}\)</span>. Though there are
563+
several ways to do this, for now we are simply going to stick to a
564+
simply approximation: <span
565+
class="math display">\[\hat{A}_{\theta_{\text{old}}} =
566+
Q_{\theta_{\text{old}}}\]</span> Which is <strong>only true up to an
567+
additive constant</strong>, but that is good enough for our purposes.
568+
For another method see the section on <a
569+
href="#generalized-advantage-estimation">GAE</a>.</p>
570+
<p>Finally, we replace the sum over actions with an importance
571+
sampler.</p>
572+
<p>These three together then given the empirically computable version of
573+
<span class="math inline">\(L\)</span>: <span
574+
class="math display">\[L_{\theta_{\text{old}}} = \mathbb{E}_{s \sim
575+
\rho_{\text{old}}, a \sim q} \left[ \frac{\pi_\theta(a|s)}{q(a|s)}
576+
Q_{\theta_{\text{old}}}(s,a) \right]\]</span></p>
577+
<p><strong>Some practical considerations.</strong> It is now a fitting
578+
time to think about implementation.</p>
579+
<p>Another way of approximating <span
580+
class="math inline">\(L_{\theta_{\text{old}}}\)</span> can be found in
581+
the article <a href="asd">asd</a>.</p>
510582
<h3 id="almost-trpo-ppo">Almost TRPO: PPO</h3>
511-
<h2 id="implementations-and-their-consequences.">Implementations and
512-
their consequences.</h2>
583+
<p><em>See</em></p>
584+
<p>The conclusion of the path-sampling strategy in the <a
585+
href="#beyond-mixture-policies---trpo">previous section</a> was
586+
optimizing the objective (we omit <span
587+
class="math inline">\(\theta_{\text{old}}\)</span> since it is clear
588+
from the notation what we mean) <span class="math display">\[L(\theta) =
589+
\mathbb{E} \left[ \frac{\pi_\theta(a_t |
590+
s_t)}{\pi_{\theta_{\text{old}}}(a_t| s_t)} \hat{A}_t \right]\]</span>
591+
For ease of notation, we denote the quantity <span
592+
class="math inline">\(\frac{\pi_\theta(a_t |
593+
s_t)}{\pi_{\theta_{\text{old}}}(a_t| s_t)}\)</span> by <span
594+
class="math inline">\(r_t(\theta)\)</span>. Intuitively, if <span
595+
class="math inline">\(r_t(\theta)\)</span> is far from <span
596+
class="math inline">\(1\)</span>, the updates can be jerky and large.
597+
This is undesirable for many reasons, which we won’t go into in too much
598+
detail, but here is an illustrative tale from the world of
599+
optimization.</p>
600+
<blockquote>
601+
<p>tale from opti</p>
602+
</blockquote>
603+
<p>One way to ensure this, is to disincentivise the model from making
604+
large changes in the policy. As you can (and should) check, the
605+
following objective does exactly that:</p>
606+
<p><span class="math display">\[L(\theta) = \mathbb{E} \left\{ \min
607+
\left[ r_t(\theta) A_t, \operatorname{clip}(r_t(\theta, 1- \epsilon, 1 +
608+
\epsilon)) \right] \right\}\]</span></p>
513609
<h3 id="generalized-advantage-estimation.">Generalized advantage
514610
estimation.</h3>
611+
<p><em>This section is based on <a
612+
href="https://arxiv.org/abs/1506.02438">Schulman et.
613+
al. 2015</a>.</em></p>
614+
<p>We have already seen the need for a way to approximate the advantage
615+
<span class="math inline">\(A_t(a)\)</span> if an action compared to
616+
some reference policy. In the case of TRPO we chose one of the easiest
617+
routes. We now describe a more elaborate one.</p>
618+
<p>As promised in the section on <a
619+
href="#temporal-difference-learning">Temporal Difference Learning</a>,
620+
the core comes from the fact that <span
621+
class="math inline">\(\mathbb{E}_{s&#39;}[\delta_a^{V_{\pi, \gamma}}] =
622+
A_{\pi, \gamma}(s, a)\)</span>. In fact, we consider the estimates <span
623+
class="math inline">\(\hat{A}_t^{(i)}\)</span> given by:</p>
515624
<h3 id="grpo">GRPO</h3>
516-
<h1 id="references">References</h1>
625+
<p><em>This section is based on [asd]</em>.</p>
626+
<p>The fundamental idea here can be explained as follows.</p>
627+
<blockquote>
628+
<p>In PPO (and TRPO for that regard), we need a way to approximate the
629+
advantage <span class="math inline">\(A_\pi(s, a)\)</span>. We could
630+
simply use <span class="math inline">\(Q_\pi(s,a)\)</span>, but this is
631+
usually has high variance. We could instead use GAE, but then one also
632+
needs to train a value model <span class="math inline">\(V\)</span>,
633+
which can be similar in size and complexity to the main policy
634+
model.</p>
635+
</blockquote>
636+
<hr />
637+
<p><em>References.</em></p>
517638
<p><a
518639
href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient">OpenAI
519640
Spinning up</a></p>
@@ -522,8 +643,12 @@ <h1 id="references">References</h1>
522643
Learning. In Proceedings of the Nineteenth International Conference on
523644
Machine Learning (ICML ’02). Morgan Kaufmann Publishers Inc., San
524645
Francisco, CA, USA, 267–274.</a></p>
525-
<p><a href="https://dl.acm.org/doi/10.5555/645531.656005">Schulman, J.,
526-
Levine, S., Abbeel, P., Jordan, M.I., &amp; Moritz, P. (2015). Trust
527-
Region Policy Optimization. ArXiv, abs/1502.05477.</a></p>
646+
<p><a href="https://arxiv.org/abs/1502.05477">Schulman, J., Levine, S.,
647+
Abbeel, P., Jordan, M.I., &amp; Moritz, P. (2015). Trust Region Policy
648+
Optimization. ArXiv, abs/1502.05477.</a></p>
649+
<p><a href="https://arxiv.org/abs/1506.02438">Schulman, J., Moritz, P.,
650+
Levine, S., Jordan, M., &amp; Abbeel, P. (2015). High-dimensional
651+
continuous control using generalized advantage estimation. arXiv
652+
preprint arXiv:1506.02438.</a></p>
528653
</body>
529654
</html>

0 commit comments

Comments
 (0)