@@ -193,8 +193,13 @@ <h1 class="title">My Document Title</h1>
193193< li > < a href ="#policy-optimization-methods. "
194194id ="toc-policy-optimization-methods. "> Policy optimization methods.</ a >
195195< ul >
196+ < li > < a href ="#some-warmup. " id ="toc-some-warmup. "> Some warmup.</ a >
197+ < ul >
196198< li > < a href ="#exact-policy-iteration. "
197199id ="toc-exact-policy-iteration. "> Exact policy iteration.</ a > </ li >
200+ < li > < a href ="#policy-gradient. " id ="toc-policy-gradient. "> Policy
201+ gradient.</ a > </ li >
202+ </ ul > </ li >
198203< li > < a href ="#approximate-value-functions-methods. "
199204id ="toc-approximate-value-functions-methods. "> Approximate value
200205functions methods.</ a >
@@ -204,6 +209,17 @@ <h1 class="title">My Document Title</h1>
204209< li > < a href ="#beyond-mixture-policies---trpo "
205210id ="toc-beyond-mixture-policies---trpo "> Beyond mixture policies -
206211TRPO</ a > </ li >
212+ < li > < a href ="#almost-trpo-ppo " id ="toc-almost-trpo-ppo "> Almost TRPO:
213+ PPO</ a > </ li >
214+ </ ul > </ li >
215+ < li > < a href ="#implementations-and-their-consequences. "
216+ id ="toc-implementations-and-their-consequences. "> Implementations and
217+ their consequences.</ a >
218+ < ul >
219+ < li > < a href ="#generalized-advantage-estimation. "
220+ id ="toc-generalized-advantage-estimation. "> Generalized advantage
221+ estimation.</ a > </ li >
222+ < li > < a href ="#grpo " id ="toc-grpo "> GRPO</ a > </ li >
207223</ ul > </ li >
208224</ ul > </ li >
209225< li > < a href ="#references " id ="toc-references "> References</ a > </ li >
@@ -299,14 +315,54 @@ <h1 id="policy-optimization-methods.">Policy optimization methods.</h1>
299315< li > We want to be able to investigate the asymptotic behavior of the
300316method.</ li >
301317</ ol >
302- < h2 id ="exact-policy-iteration. "> Exact policy iteration.</ h2 >
318+ < h2 id ="some-warmup. "> Some warmup.</ h2 >
319+ < h3 id ="exact-policy-iteration. "> Exact policy iteration.</ h3 >
303320< p > The basic idea is the following. If we have access to the exact value
304321function, given a policy < span class ="math inline "> \(\pi\)</ span > , we
305322can compute < span class ="math inline "> \(Q_\pi(s, a)\)</ span > explicitly
306323for each < span class ="math inline "> \((s,a)\)</ span > pair, and create a
307324new deterministic policy < span class ="math inline "> \(\pi'\)</ span > ,
308325such that < span class ="math display "> \[\pi'(a;s) = \text{$1$ iff $a
309326= \arg \max_a Q_\pi(s,a)$, $0$ otherwise}\]</ span > </ p >
327+ < h3 id ="policy-gradient. "> Policy gradient.</ h3 >
328+ < p > To take the theory out for a spin, consider the following. Let < span
329+ class ="math inline "> \(\pi_\theta\)</ span > be a parametric family of
330+ policies. Suppose that the initial state of the universe is determined
331+ by some initial distribution < span
332+ class ="math inline "> \(\rho_0\)</ span > . Then, our aim is to find the
333+ policy which would maximize the expression < span
334+ class ="math display "> \[J(\theta) = E_{\tau \sim \pi_\theta}
335+ [R(\tau)]\]</ span > An intuitive way to do this would be to compute the
336+ gradient < span class ="math inline "> \(\nabla_\theta J(\theta)\)</ span > ,
337+ and perform some form of gradient optimization method.< br />
338+ Let us start by investigating how changing theta affects the evolution
339+ of the universe by thinking about < span
340+ class ="math inline "> \(\nabla_\theta P(\tau | \theta)\)</ span > . By the
341+ < em > “log-derivative trick”</ em > : < span
342+ class ="math display "> \[\nabla_\theta P(\tau | \pi_\theta) = P(\tau |
343+ \pi_\theta) \left( \nabla_\theta \log P(\tau | \pi_{\theta}) \right)
344+ \]</ span > The probability of a trajectory < span
345+ class ="math inline "> \(\tau = (s_0, a_0, \ldots)\)</ span > given an
346+ initial distribution < span class ="math inline "> \(\rho_0\)</ span > is
347+ given by: < span class ="math display "> \[P(\tau | \theta) = \rho(s_0)
348+ \prod_i P(s_{i+1} ; s_i, a_i) \pi_{\theta}(a_i | s_i)\]</ span > This way
349+ < span class ="math display "> \[\log P(\tau | \theta) = \log \rho (s_0) +
350+ \sum_i \log P(s_{i+1}; s_i, a_i) + \log \pi_{\theta}(a_i|s_i)\]</ span >
351+ So if now we take the derivative of this with respect to < span
352+ class ="math inline "> \(\theta\)</ span > , all the < span
353+ class ="math inline "> \(P(s_{i+1}; s_i, a_i)\)</ span > terms will vanish,
354+ since they are not dependent on the parameter < span
355+ class ="math inline "> \(\theta\)</ span > . Hence: < span
356+ class ="math display "> \[\nabla_\theta P(\tau | \theta) = P(\tau | \theta)
357+ \left( \sum_i \nabla_\theta \log \pi_{\theta}(a_i | s_i)
358+ \right)\]</ span > So then, < span class ="math display "> \[\begin{align*}
359+ \nabla_\theta J(\theta) &= E_{\tau \sim \pi_\theta} [R(\tau)]\\
360+ &= \nabla_\theta \int_\tau P(\tau | \theta) R(\tau)\\
361+ &= \int_\tau P(\tau | \theta) \left( \sum_i \nabla_\theta \log
362+ \pi_{\theta}(a_i | s_i) \right) R(\tau)\\
363+ &= E_{\tau \sim \pi_\theta} [\left( \sum_i \nabla_\theta \log
364+ \pi_{\theta}(a_i | s_i) \right) R(\tau)]
365+ \end{align*}\]</ span > </ p >
310366< h2 id ="approximate-value-functions-methods. "> Approximate value
311367functions methods.</ h2 >
312368< p > One major drawback of Exact Policy Iteration is the fact that it is
@@ -346,7 +402,7 @@ <h2 id="approximate-value-functions-methods.">Approximate value
346402way, prove that < span class ="math inline "> \(L_\pi(\pi_\theta)\)</ span >
347403agrees with < span class ="math inline "> \(J(\pi_\theta)\)</ span > up to
348404first order.</ p >
349- < p > There is a natural question to ask now. </ p >
405+ < p > There is a natural question to ask now: </ p >
350406< blockquote >
351407< p > We could simply optimize < span class ="math inline "> \(J\)</ span > by
352408substituting it with < span class ="math inline "> \(L\)</ span > , and using
@@ -444,7 +500,19 @@ <h3 id="beyond-mixture-policies---trpo">Beyond mixture policies -
444500class ="math inline "> \(C\)</ span > as suggested above, the step sizes of
445501the method end up being really small. Instead, one can consider a < a
446502href ="trust_region_type_methods "> trust-region type method</ a > :</ p >
503+ < pre > < code > while (not converged):
504+ pi_new = argmax_pi L_{pi_i}(pi)
505+ such that CD_KL_max (pi_old, pi) <= delta</ code > </ pre >
447506</ blockquote >
507+ < p > One final thing to note is that one rarely encounters TRPO in the
508+ form we have just described. The following form of the objective is
509+ used:</ p >
510+ < h3 id ="almost-trpo-ppo "> Almost TRPO: PPO</ h3 >
511+ < h2 id ="implementations-and-their-consequences. "> Implementations and
512+ their consequences.</ h2 >
513+ < h3 id ="generalized-advantage-estimation. "> Generalized advantage
514+ estimation.</ h3 >
515+ < h3 id ="grpo "> GRPO</ h3 >
448516< h1 id ="references "> References</ h1 >
449517< p > < a
450518href ="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient "> OpenAI
0 commit comments