Skip to content

Commit fbc597f

Browse files
committed
rl
1 parent 9e8cc05 commit fbc597f

1 file changed

Lines changed: 70 additions & 2 deletions

File tree

rl.html

Lines changed: 70 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -193,8 +193,13 @@ <h1 class="title">My Document Title</h1>
193193
<li><a href="#policy-optimization-methods."
194194
id="toc-policy-optimization-methods.">Policy optimization methods.</a>
195195
<ul>
196+
<li><a href="#some-warmup." id="toc-some-warmup.">Some warmup.</a>
197+
<ul>
196198
<li><a href="#exact-policy-iteration."
197199
id="toc-exact-policy-iteration.">Exact policy iteration.</a></li>
200+
<li><a href="#policy-gradient." id="toc-policy-gradient.">Policy
201+
gradient.</a></li>
202+
</ul></li>
198203
<li><a href="#approximate-value-functions-methods."
199204
id="toc-approximate-value-functions-methods.">Approximate value
200205
functions methods.</a>
@@ -204,6 +209,17 @@ <h1 class="title">My Document Title</h1>
204209
<li><a href="#beyond-mixture-policies---trpo"
205210
id="toc-beyond-mixture-policies---trpo">Beyond mixture policies -
206211
TRPO</a></li>
212+
<li><a href="#almost-trpo-ppo" id="toc-almost-trpo-ppo">Almost TRPO:
213+
PPO</a></li>
214+
</ul></li>
215+
<li><a href="#implementations-and-their-consequences."
216+
id="toc-implementations-and-their-consequences.">Implementations and
217+
their consequences.</a>
218+
<ul>
219+
<li><a href="#generalized-advantage-estimation."
220+
id="toc-generalized-advantage-estimation.">Generalized advantage
221+
estimation.</a></li>
222+
<li><a href="#grpo" id="toc-grpo">GRPO</a></li>
207223
</ul></li>
208224
</ul></li>
209225
<li><a href="#references" id="toc-references">References</a></li>
@@ -299,14 +315,54 @@ <h1 id="policy-optimization-methods.">Policy optimization methods.</h1>
299315
<li>We want to be able to investigate the asymptotic behavior of the
300316
method.</li>
301317
</ol>
302-
<h2 id="exact-policy-iteration.">Exact policy iteration.</h2>
318+
<h2 id="some-warmup.">Some warmup.</h2>
319+
<h3 id="exact-policy-iteration.">Exact policy iteration.</h3>
303320
<p>The basic idea is the following. If we have access to the exact value
304321
function, given a policy <span class="math inline">\(\pi\)</span>, we
305322
can compute <span class="math inline">\(Q_\pi(s, a)\)</span> explicitly
306323
for each <span class="math inline">\((s,a)\)</span> pair, and create a
307324
new deterministic policy <span class="math inline">\(\pi&#39;\)</span>,
308325
such that <span class="math display">\[\pi&#39;(a;s) = \text{$1$ iff $a
309326
= \arg \max_a Q_\pi(s,a)$, $0$ otherwise}\]</span></p>
327+
<h3 id="policy-gradient.">Policy gradient.</h3>
328+
<p>To take the theory out for a spin, consider the following. Let <span
329+
class="math inline">\(\pi_\theta\)</span> be a parametric family of
330+
policies. Suppose that the initial state of the universe is determined
331+
by some initial distribution <span
332+
class="math inline">\(\rho_0\)</span>. Then, our aim is to find the
333+
policy which would maximize the expression <span
334+
class="math display">\[J(\theta) = E_{\tau \sim \pi_\theta}
335+
[R(\tau)]\]</span> An intuitive way to do this would be to compute the
336+
gradient <span class="math inline">\(\nabla_\theta J(\theta)\)</span>,
337+
and perform some form of gradient optimization method.<br />
338+
Let us start by investigating how changing theta affects the evolution
339+
of the universe by thinking about <span
340+
class="math inline">\(\nabla_\theta P(\tau | \theta)\)</span>. By the
341+
<em>“log-derivative trick”</em>: <span
342+
class="math display">\[\nabla_\theta P(\tau | \pi_\theta) = P(\tau |
343+
\pi_\theta) \left( \nabla_\theta \log P(\tau | \pi_{\theta}) \right)
344+
\]</span> The probability of a trajectory <span
345+
class="math inline">\(\tau = (s_0, a_0, \ldots)\)</span> given an
346+
initial distribution <span class="math inline">\(\rho_0\)</span> is
347+
given by: <span class="math display">\[P(\tau | \theta) = \rho(s_0)
348+
\prod_i P(s_{i+1} ; s_i, a_i) \pi_{\theta}(a_i | s_i)\]</span> This way
349+
<span class="math display">\[\log P(\tau | \theta) = \log \rho (s_0) +
350+
\sum_i \log P(s_{i+1}; s_i, a_i) + \log \pi_{\theta}(a_i|s_i)\]</span>
351+
So if now we take the derivative of this with respect to <span
352+
class="math inline">\(\theta\)</span>, all the <span
353+
class="math inline">\(P(s_{i+1}; s_i, a_i)\)</span> terms will vanish,
354+
since they are not dependent on the parameter <span
355+
class="math inline">\(\theta\)</span>. Hence: <span
356+
class="math display">\[\nabla_\theta P(\tau | \theta) = P(\tau | \theta)
357+
\left( \sum_i \nabla_\theta \log \pi_{\theta}(a_i | s_i)
358+
\right)\]</span> So then, <span class="math display">\[\begin{align*}
359+
\nabla_\theta J(\theta) &amp;= E_{\tau \sim \pi_\theta} [R(\tau)]\\
360+
&amp;= \nabla_\theta \int_\tau P(\tau | \theta) R(\tau)\\
361+
&amp;= \int_\tau P(\tau | \theta) \left( \sum_i \nabla_\theta \log
362+
\pi_{\theta}(a_i | s_i) \right) R(\tau)\\
363+
&amp;= E_{\tau \sim \pi_\theta} [\left( \sum_i \nabla_\theta \log
364+
\pi_{\theta}(a_i | s_i) \right) R(\tau)]
365+
\end{align*}\]</span></p>
310366
<h2 id="approximate-value-functions-methods.">Approximate value
311367
functions methods.</h2>
312368
<p>One major drawback of Exact Policy Iteration is the fact that it is
@@ -346,7 +402,7 @@ <h2 id="approximate-value-functions-methods.">Approximate value
346402
way, prove that <span class="math inline">\(L_\pi(\pi_\theta)\)</span>
347403
agrees with <span class="math inline">\(J(\pi_\theta)\)</span> up to
348404
first order.</p>
349-
<p>There is a natural question to ask now.</p>
405+
<p>There is a natural question to ask now:</p>
350406
<blockquote>
351407
<p>We could simply optimize <span class="math inline">\(J\)</span> by
352408
substituting it with <span class="math inline">\(L\)</span>, and using
@@ -444,7 +500,19 @@ <h3 id="beyond-mixture-policies---trpo">Beyond mixture policies -
444500
class="math inline">\(C\)</span> as suggested above, the step sizes of
445501
the method end up being really small. Instead, one can consider a <a
446502
href="trust_region_type_methods">trust-region type method</a>:</p>
503+
<pre><code>while (not converged):
504+
pi_new = argmax_pi L_{pi_i}(pi)
505+
such that CD_KL_max (pi_old, pi) &lt;= delta</code></pre>
447506
</blockquote>
507+
<p>One final thing to note is that one rarely encounters TRPO in the
508+
form we have just described. The following form of the objective is
509+
used:</p>
510+
<h3 id="almost-trpo-ppo">Almost TRPO: PPO</h3>
511+
<h2 id="implementations-and-their-consequences.">Implementations and
512+
their consequences.</h2>
513+
<h3 id="generalized-advantage-estimation.">Generalized advantage
514+
estimation.</h3>
515+
<h3 id="grpo">GRPO</h3>
448516
<h1 id="references">References</h1>
449517
<p><a
450518
href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient">OpenAI

0 commit comments

Comments
 (0)