99 < title > My Document Title</ title >
1010 < style >
1111 html {
12- font-size : 12pt ;
1312 color : # 1a1a1a ;
1413 background-color : # fdfdfd ;
1514 }
@@ -180,57 +179,21 @@ <h1 class="title">My Document Title</h1>
180179< p class ="date "> 2025-05-02</ p >
181180</ header >
182181< nav id ="TOC " role ="doc-toc ">
183- < ul >
184- < li > < a href ="#short-intro-to-rl " id ="toc-short-intro-to-rl "> Short intro
185- to RL</ a >
186- < ul >
187- < li > < a href ="#reward-and-return " id ="toc-reward-and-return "> Reward and
188- Return</ a > </ li >
189- < li > < a href ="#the-goal " id ="toc-the-goal "> The Goal</ a > </ li >
190- < li > < a href ="#a-few-useful-quantities "
191- id ="toc-a-few-useful-quantities "> A few useful quantities</ a > </ li >
192- </ ul > </ li >
193- < li > < a href ="#policy-optimization-methods. "
194- id ="toc-policy-optimization-methods. "> Policy optimization methods.</ a >
195- < ul >
196- < li > < a href ="#some-warmup. " id ="toc-some-warmup. "> Some warmup.</ a >
197- < ul >
198- < li > < a href ="#exact-policy-iteration. "
199- id ="toc-exact-policy-iteration. "> Exact policy iteration.</ a > </ li >
200- < li > < a href ="#policy-gradient. " id ="toc-policy-gradient. "> Policy
201- gradient.</ a > </ li >
202- </ ul > </ li >
203- < li > < a href ="#approximate-value-functions-methods. "
204- id ="toc-approximate-value-functions-methods. "> Approximate value
205- functions methods.</ a >
206- < ul >
207- < li > < a href ="#mixture-policies. " id ="toc-mixture-policies. "> Mixture
208- policies.</ a > </ li >
209- < li > < a href ="#beyond-mixture-policies---trpo "
210- id ="toc-beyond-mixture-policies---trpo "> Beyond mixture policies -
211- TRPO</ a > </ li >
212- < li > < a href ="#almost-trpo-ppo " id ="toc-almost-trpo-ppo "> Almost TRPO:
213- PPO</ a > </ li >
214- </ ul > </ li >
215- < li > < a href ="#implementations-and-their-consequences. "
216- id ="toc-implementations-and-their-consequences. "> Implementations and
217- their consequences.</ a >
218- < ul >
219- < li > < a href ="#generalized-advantage-estimation. "
220- id ="toc-generalized-advantage-estimation. "> Generalized advantage
221- estimation.</ a > </ li >
222- < li > < a href ="#grpo " id ="toc-grpo "> GRPO</ a > </ li >
223- </ ul > </ li >
224- </ ul > </ li >
225- < li > < a href ="#references " id ="toc-references "> References</ a > </ li >
226- </ ul >
182+
227183</ nav >
228184< p > < em > A few preliminary remarks.</ em > This page is intended to be a
229185work, where I collect resources, derivations, ideas etc. that I found
230186useful when diving into the world of RL. In particular, it is not a
231187hand-held, explain everything type introduction, but rather one that a
232188person with some mathematical affinity (not much) could understand. As
233189such, not everything is discussed in detail.</ p >
190+ < p > One way I believe it is really helpful to think about RL, especially
191+ within the context of this note is the connection between Differential
192+ Geometry and Riemannian Geometry: Fundamentally, ideas often come from
193+ the classical optimization literature, but have a different flavour due
194+ to the highly structured nature of the functions we are trying to
195+ optimize. I will try my best to give references to optimization material
196+ whenever it is appropiate to connect these fields.</ p >
234197< h1 id ="short-intro-to-rl "> Short intro to RL</ h1 >
235198< p > < em > This section is inspired by < a
236199href ="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient "> 1</ a > .</ em > </ p >
@@ -300,6 +263,38 @@ <h3 id="a-few-useful-quantities">A few useful quantities</h3>
300263action with respect to a policy < span
301264class ="math inline "> \(\pi\)</ span > : < span
302265class ="math display "> \[A_\pi(s, a) = Q_\pi(s, a) - V_\pi(s)\]</ span > </ p >
266+ < h3 id ="temporal-difference-learning. "> Temporal difference
267+ learning.</ h3 >
268+ < p > In practice the value function is not known, and can be complicated
269+ to compute explicitly. One way though, in which one can try to deduce it
270+ is the following. A self-consistency relation of < span
271+ class ="math inline "> \(V\)</ span > is the following: < span
272+ class ="math display "> \[V_\pi(s) = E_{a \sim \pi(\cdot | s) } \left[ r(s,
273+ a) + V(s') \right]\]</ span > Thus given a guess < span
274+ class ="math inline "> \(V\)</ span > for the value function, one use can
275+ tweak < span class ="math inline "> \(V\)</ span > to promote
276+ self-consistency:</ p >
277+ < pre > < code > Initialize V randomly
278+
279+ while not converged:
280+ Pick a state s randomly
281+ Pick an action a according to pi
282+ V(s) := (1 - a) V(s) + a (r(s, a) + \gamma V(s'))</ code > </ pre >
283+ < p > We achieve perfect self-consistency if the < em > TD-residual of V with
284+ discount < span class ="math inline "> \(\gamma\)</ span > </ em > , given by
285+ < span class ="math display "> \[\delta_{a}^{V_{\pi, \gamma}}(s, s') =
286+ r(s, a) + \gamma V_{\pi}(s') - V_{\pi}(s)\]</ span > satisfies < span
287+ class ="math inline "> \(E_{a \sim \pi(\cdot|s)} [\delta_{a}^{V^{\pi,
288+ \gamma}}(s, s')] = 0 \quad (\dagger)\)</ span > .</ p >
289+ < p > The TD-residual has another application, which we will see in the
290+ section on < a href ="#generalized-advantage-estimation "> GAE</ a > . But to
291+ give you a foretaste, let’s fix < span class ="math inline "> \(s\)</ span >
292+ and < span class ="math inline "> \(a\)</ span > in < span
293+ class ="math inline "> \((\dagger)\)</ span > , and take the expectation with
294+ respect to < span class ="math inline "> \(s'\)</ span > . We get: < span
295+ class ="math display "> \[\mathbb{E}_{s'}[\delta_a^{V_{\pi, \gamma}}] =
296+ A_{\pi, \gamma}(s, a)\]</ span > Without giving too much away, this idea
297+ can be iterated to reduce the variance of this estimation.</ p >
303298< h1 id ="policy-optimization-methods. "> Policy optimization methods.</ h1 >
304299< p > The methods we are going to encounter are all based on iteratively
305300improving a policy that we currently have to a “better one”. To make the
@@ -363,6 +358,55 @@ <h3 id="policy-gradient.">Policy gradient.</h3>
363358 &= E_{\tau \sim \pi_\theta} [\left( \sum_i \nabla_\theta \log
364359\pi_{\theta}(a_i | s_i) \right) R(\tau)]
365360\end{align*}\]</ span > </ p >
361+ < p > Set < span class ="math inline "> \(R_i(\tau) = \sum_{j \geq t}
362+ r_j\)</ span > . Then, < span class ="math inline "> \(R(\tau) = \sum_{i <
363+ t} r_i + R_t(\tau)\)</ span > , and consequentially: < span
364+ class ="math display "> \[\begin{align*}
365+ E_{\tau \sim \pi_\theta} [\left( \sum_i \nabla_\theta \log
366+ \pi_{\theta}(a_i | s_i) \right) R(\tau)] &= \sum_i E_{\tau \sim
367+ \pi_\theta}[\left(\nabla_\theta \log \pi_{\theta}(a_i | s_i) \right)
368+ R(\tau)]\\
369+ &= \sum_i E_{\tau \sim \pi_\theta}[\left(\nabla_\theta \log
370+ \pi_{\theta}(a_i | s_i) \right) (R_i + \sum_{j<t} r_j)]\\
371+ &= \sum_i E_{\tau \sim \pi_\theta}[\left(\nabla_\theta \log
372+ \pi_{\theta}(a_i | s_i) \right) R_i]
373+ \end{align*}\]</ span > </ p >
374+ < p > Where the last inequality follows from the fact that the action < span
375+ class ="math inline "> \(a_t\)</ span > is independent of the states < span
376+ class ="math inline "> \(s_i\)</ span > for < span class ="math inline "> \(i
377+ < t\)</ span > . This gives the < strong > past independent form</ strong >
378+ of the policy gradient: < span
379+ class ="math display "> \[\boxed{\nabla_\theta J(\theta) = E_{\tau \sim
380+ \pi_\theta} [\left( \sum_i \nabla_\theta \log \pi_{\theta}(a_i | s_i)
381+ \right) R_i(\tau)]}\]</ span > </ p >
382+ < p > Another useful trick is to notice that for < span
383+ class ="math inline "> \(s\)</ span > fixed, < span
384+ class ="math inline "> \(\pi_\theta(\cdot|s)\)</ span > becomes a
385+ parameterized distribution on < span class ="math inline "> \(A\)</ span > . It
386+ is a general result, that < span class ="math inline "> \(E_{a \sim
387+ \pi_{(\cdot|s)}} [\nabla_\theta \log \pi_\theta(a|s)] = 0\)</ span > . Upon
388+ multiplying this with a constant < span
389+ class ="math inline "> \(b(s_t)\)</ span > , we obtain that < span
390+ class ="math display "> \[E_{a \sim \pi_{(\cdot|s)}} [\nabla_\theta \log
391+ \pi_\theta(a|s) b(s)] = 0\]</ span > We can also write < span
392+ class ="math display "> \[\begin{align*}
393+ E_{\tau \sim \pi_{\theta}}[\nabla_\theta \log \pi_\theta(a_i|s_i)
394+ b(s_i)] &= \sum_{s, a} P(s_i = a, a_i = a) \nabla_\theta \log
395+ \pi_\theta(a_i|s_i) b(s_i)\\
396+ &= \sum_s P(s_i = s) \left[ \sum_a P(a_i = a | s_i = s)
397+ \nabla_\theta \log \pi_\theta(a_i|s_i) b(s_i) \right]\\
398+ &= \sum_s P(s_i = s) \left[ E_{a \sim \pi_\theta(\cdot|s)}
399+ [\nabla_\theta \log \pi_\theta(a|s) b(s_i)] \right] = 0
400+ \end{align*}\]</ span > </ p >
401+ < p > Using this claim, we can derive another unbiased estimator for the
402+ policy gradient:</ p >
403+ < p > < span class ="math display "> \[\boxed{\nabla_\theta J(\theta) = E_{\tau
404+ \sim \pi_\theta} [\left( \sum_i \nabla_\theta \log \pi_{\theta}(a_i |
405+ s_i) \right) \left( R_i(\tau) + b(s_t) \right)]} \]</ span > </ p >
406+ < p > The main difference between these is that some of them have lower
407+ variance than the other. While it is not instrumental to know all these
408+ derivations inside out, it is useful to be aware of the different forms
409+ of the policy gradient.</ p >
366410< h2 id ="approximate-value-functions-methods. "> Approximate value
367411functions methods.</ h2 >
368412< p > One major drawback of Exact Policy Iteration is the fact that it is
@@ -446,8 +490,8 @@ <h3 id="mixture-policies.">Mixture policies.</h3>
446490\varepsilon}{1- \gamma(1 - \alpha)} \right)\]</ span > </ p >
447491< h3 id ="beyond-mixture-policies---trpo "> Beyond mixture policies -
448492TRPO</ h3 >
449- < p > < em > The idea of this section was originally published in < a
450- href =" https://arxiv.org/abs/1502.05477 " > here</ a > </ em > .</ p >
493+ < p > < em > The idea of this section was originally published in
494+ [ here][3] </ em > .</ p >
451495< p > Fundamentally, we are motivated by the following thought, that
452496extends the result of the previous section:</ p >
453497< blockquote >
@@ -504,16 +548,93 @@ <h3 id="beyond-mixture-policies---trpo">Beyond mixture policies -
504548 pi_new = argmax_pi L_{pi_i}(pi)
505549 such that CD_KL_max (pi_old, pi) <= delta</ code > </ pre >
506550</ blockquote >
507- < p > One final thing to note is that one rarely encounters TRPO in the
508- form we have just described. The following form of the objective is
509- used:</ p >
551+ < p > < strong > Implementation.</ strong > We have enough machinery now to
552+ think about how we might want to implement this method. We need to come
553+ up with a way to approxiamte < span
554+ class ="math display "> \[L_{\theta_{\text{old}}}(\theta) = \sum_s
555+ \rho_{\text{old}}(s) \sum_a \pi_\theta(a | s)
556+ A_{\theta_{\text{old}}}(s,a)\]</ span > </ p >
557+ < p > The first and most obvious thing that needs to happen, is
558+ approximating < span class ="math inline "> \(\sum_s
559+ \rho_{\text{old}}(s)[\cdots]\)</ span > with a montecarlo estiamte based
560+ on our data.</ p >
561+ < p > Then, we need a way to estimate the advantage < span
562+ class ="math inline "> \(A_{\theta_{\text{old}}}\)</ span > . Though there are
563+ several ways to do this, for now we are simply going to stick to a
564+ simply approximation: < span
565+ class ="math display "> \[\hat{A}_{\theta_{\text{old}}} =
566+ Q_{\theta_{\text{old}}}\]</ span > Which is < strong > only true up to an
567+ additive constant</ strong > , but that is good enough for our purposes.
568+ For another method see the section on < a
569+ href ="#generalized-advantage-estimation "> GAE</ a > .</ p >
570+ < p > Finally, we replace the sum over actions with an importance
571+ sampler.</ p >
572+ < p > These three together then given the empirically computable version of
573+ < span class ="math inline "> \(L\)</ span > : < span
574+ class ="math display "> \[L_{\theta_{\text{old}}} = \mathbb{E}_{s \sim
575+ \rho_{\text{old}}, a \sim q} \left[ \frac{\pi_\theta(a|s)}{q(a|s)}
576+ Q_{\theta_{\text{old}}}(s,a) \right]\]</ span > </ p >
577+ < p > < strong > Some practical considerations.</ strong > It is now a fitting
578+ time to think about implementation.</ p >
579+ < p > Another way of approximating < span
580+ class ="math inline "> \(L_{\theta_{\text{old}}}\)</ span > can be found in
581+ the article < a href ="asd "> asd</ a > .</ p >
510582< h3 id ="almost-trpo-ppo "> Almost TRPO: PPO</ h3 >
511- < h2 id ="implementations-and-their-consequences. "> Implementations and
512- their consequences.</ h2 >
583+ < p > < em > See</ em > </ p >
584+ < p > The conclusion of the path-sampling strategy in the < a
585+ href ="#beyond-mixture-policies---trpo "> previous section</ a > was
586+ optimizing the objective (we omit < span
587+ class ="math inline "> \(\theta_{\text{old}}\)</ span > since it is clear
588+ from the notation what we mean) < span class ="math display "> \[L(\theta) =
589+ \mathbb{E} \left[ \frac{\pi_\theta(a_t |
590+ s_t)}{\pi_{\theta_{\text{old}}}(a_t| s_t)} \hat{A}_t \right]\]</ span >
591+ For ease of notation, we denote the quantity < span
592+ class ="math inline "> \(\frac{\pi_\theta(a_t |
593+ s_t)}{\pi_{\theta_{\text{old}}}(a_t| s_t)}\)</ span > by < span
594+ class ="math inline "> \(r_t(\theta)\)</ span > . Intuitively, if < span
595+ class ="math inline "> \(r_t(\theta)\)</ span > is far from < span
596+ class ="math inline "> \(1\)</ span > , the updates can be jerky and large.
597+ This is undesirable for many reasons, which we won’t go into in too much
598+ detail, but here is an illustrative tale from the world of
599+ optimization.</ p >
600+ < blockquote >
601+ < p > tale from opti</ p >
602+ </ blockquote >
603+ < p > One way to ensure this, is to disincentivise the model from making
604+ large changes in the policy. As you can (and should) check, the
605+ following objective does exactly that:</ p >
606+ < p > < span class ="math display "> \[L(\theta) = \mathbb{E} \left\{ \min
607+ \left[ r_t(\theta) A_t, \operatorname{clip}(r_t(\theta, 1- \epsilon, 1 +
608+ \epsilon)) \right] \right\}\]</ span > </ p >
513609< h3 id ="generalized-advantage-estimation. "> Generalized advantage
514610estimation.</ h3 >
611+ < p > < em > This section is based on < a
612+ href ="https://arxiv.org/abs/1506.02438 "> Schulman et.
613+ al. 2015</ a > .</ em > </ p >
614+ < p > We have already seen the need for a way to approximate the advantage
615+ < span class ="math inline "> \(A_t(a)\)</ span > if an action compared to
616+ some reference policy. In the case of TRPO we chose one of the easiest
617+ routes. We now describe a more elaborate one.</ p >
618+ < p > As promised in the section on < a
619+ href ="#temporal-difference-learning "> Temporal Difference Learning</ a > ,
620+ the core comes from the fact that < span
621+ class ="math inline "> \(\mathbb{E}_{s'}[\delta_a^{V_{\pi, \gamma}}] =
622+ A_{\pi, \gamma}(s, a)\)</ span > . In fact, we consider the estimates < span
623+ class ="math inline "> \(\hat{A}_t^{(i)}\)</ span > given by:</ p >
515624< h3 id ="grpo "> GRPO</ h3 >
516- < h1 id ="references "> References</ h1 >
625+ < p > < em > This section is based on [asd]</ em > .</ p >
626+ < p > The fundamental idea here can be explained as follows.</ p >
627+ < blockquote >
628+ < p > In PPO (and TRPO for that regard), we need a way to approximate the
629+ advantage < span class ="math inline "> \(A_\pi(s, a)\)</ span > . We could
630+ simply use < span class ="math inline "> \(Q_\pi(s,a)\)</ span > , but this is
631+ usually has high variance. We could instead use GAE, but then one also
632+ needs to train a value model < span class ="math inline "> \(V\)</ span > ,
633+ which can be similar in size and complexity to the main policy
634+ model.</ p >
635+ </ blockquote >
636+ < hr />
637+ < p > < em > References.</ em > </ p >
517638< p > < a
518639href ="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient "> OpenAI
519640Spinning up</ a > </ p >
@@ -522,8 +643,12 @@ <h1 id="references">References</h1>
522643Learning. In Proceedings of the Nineteenth International Conference on
523644Machine Learning (ICML ’02). Morgan Kaufmann Publishers Inc., San
524645Francisco, CA, USA, 267–274.</ a > </ p >
525- < p > < a href ="https://dl.acm.org/doi/10.5555/645531.656005 "> Schulman, J.,
526- Levine, S., Abbeel, P., Jordan, M.I., & Moritz, P. (2015). Trust
527- Region Policy Optimization. ArXiv, abs/1502.05477.</ a > </ p >
646+ < p > < a href ="https://arxiv.org/abs/1502.05477 "> Schulman, J., Levine, S.,
647+ Abbeel, P., Jordan, M.I., & Moritz, P. (2015). Trust Region Policy
648+ Optimization. ArXiv, abs/1502.05477.</ a > </ p >
649+ < p > < a href ="https://arxiv.org/abs/1506.02438 "> Schulman, J., Moritz, P.,
650+ Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional
651+ continuous control using generalized advantage estimation. arXiv
652+ preprint arXiv:1506.02438.</ a > </ p >
528653</ body >
529654</ html >
0 commit comments