11<!DOCTYPE html>
2- < html xmlns ="http://www.w3.org/1999/xhtml " lang ="" xml:lang ="">
2+ < html xmlns ="http://www.w3.org/1999/xhtml " lang ="en " xml:lang ="en ">
33< head >
44 < meta charset ="utf-8 " />
55 < meta name ="generator " content ="pandoc " />
66 < meta name ="viewport " content ="width=device-width, initial-scale=1.0, user-scalable=yes " />
7- < title > concepts</ title >
7+ < meta name ="author " content ="Gergo Stomfai " />
8+ < meta name ="dcterms.date " content ="2025-05-02 " />
9+ < title > My Document Title</ title >
810 < style >
911 html {
12+ font-size : 12pt ;
1013 color : # 1a1a1a ;
1114 background-color : # fdfdfd ;
1215 }
171174 type ="text/javascript "> </ script >
172175</ head >
173176< body >
177+ < header id ="title-block-header ">
178+ < h1 class ="title "> My Document Title</ h1 >
179+ < p class ="author "> Gergo Stomfai</ p >
180+ < p class ="date "> 2025-05-02</ p >
181+ </ header >
174182< nav id ="TOC " role ="doc-toc ">
175183< ul >
176184< li > < a href ="#short-intro-to-rl " id ="toc-short-intro-to-rl "> Short intro
185193< li > < a href ="#policy-optimization-methods. "
186194id ="toc-policy-optimization-methods. "> Policy optimization methods.</ a >
187195< ul >
188- < li > < a href ="#approximate-value-function-methods. "
189- id ="toc-approximate-value-function-methods. "> Approximate Value function
190- methods.</ a > </ li >
191- < li > < a href ="#simplest-policy-gradient "
192- id ="toc-simplest-policy-gradient "> Simplest policy gradient</ a > </ li >
196+ < li > < a href ="#exact-policy-iteration. "
197+ id ="toc-exact-policy-iteration. "> Exact policy iteration.</ a > </ li >
198+ < li > < a href ="#approximate-value-functions-methods. "
199+ id ="toc-approximate-value-functions-methods. "> Approximate value
200+ functions methods.</ a >
201+ < ul >
193202< li > < a href ="#mixture-policies. " id ="toc-mixture-policies. "> Mixture
194203policies.</ a > </ li >
195204< li > < a href ="#beyond-mixture-policies---trpo "
196205id ="toc-beyond-mixture-policies---trpo "> Beyond mixture policies -
197206TRPO</ a > </ li >
198207</ ul > </ li >
208+ </ ul > </ li >
199209< li > < a href ="#references " id ="toc-references "> References</ a > </ li >
200210</ ul >
201211</ nav >
@@ -289,99 +299,84 @@ <h1 id="policy-optimization-methods.">Policy optimization methods.</h1>
289299< li > We want to be able to investigate the asymptotic behavior of the
290300method.</ li >
291301</ ol >
292- < h2 id ="approximate-value-function-methods. "> Approximate Value function
293- methods.</ h2 >
302+ < h2 id ="exact-policy-iteration. "> Exact policy iteration.</ h2 >
294303< p > The basic idea is the following. If we have access to the exact value
295304function, given a policy < span class ="math inline "> \(\pi\)</ span > , we
296305can compute < span class ="math inline "> \(Q_\pi(s, a)\)</ span > explicitly
297306for each < span class ="math inline "> \((s,a)\)</ span > pair, and create a
298307new deterministic policy < span class ="math inline "> \(\pi'\)</ span > ,
299308such that < span class ="math display "> \[\pi'(a;s) = \text{$1$ iff $a
300309= \arg \max_a Q_\pi(s,a)$, $0$ otherwise}\]</ span > </ p >
301- < p > Instead, we can use an approximation of the the value function to
302- perform the same game. These methods are called < strong > Approximate
303- Value Function Methods</ strong > .</ p >
304- < h2 id ="simplest-policy-gradient "> Simplest policy gradient</ h2 >
305- < p > To take the theory out for a spin, consider the following. Let < span
306- class ="math inline "> \(\pi_\theta\)</ span > be a parametric family of
307- policies. Suppose that the initial state of the universe is determined
308- by some initial distribution < span
309- class ="math inline "> \(\rho_0\)</ span > . We start by thinking about the
310- distribution < span class ="math inline "> \(P(\tau | \pi_\theta)\)</ span > ,
311- and especially about < span class ="math inline "> \(\nabla P(\tau |
312- \pi_\theta)\)</ span > . Intuitively, we have the following:</ p >
310+ < h2 id ="approximate-value-functions-methods. "> Approximate value
311+ functions methods.</ h2 >
312+ < p > One major drawback of Exact Policy Iteration is the fact that it is
313+ only guaranteed to improve the performance of the policy, if the new
314+ policy has non-negative advantage in every state. This is very hard to
315+ guarantee in practice. Thus, instead of trying to optimize < span
316+ class ="math inline "> \(\eta(\pi)\)</ span > directly, we are going to
317+ define a < em > surrogate</ em > objective function. To do this let us fix
318+ some notation first.</ p >
319+ < p > Suppose that we have a fixed initial distribution, say < span
320+ class ="math inline "> \(\mu\)</ span > . We will implicitly assume that this
321+ exists, but won’t mention that it exists very often. Then, the
322+ < em > discounted visitation frequencies</ em > are given by: < span
323+ class ="math display "> \[\rho_\pi(s) = P(s_0 = s) + \gamma P(s_1 = s) +
324+ \cdots\]</ span > And the < em > policy advantage</ em > < span
325+ class ="math inline "> \(\mathbb{A}_{\pi}(\pi')\)</ span > of a policy
326+ < span class ="math inline "> \(\pi'\)</ span > over another policy < span
327+ class ="math inline "> \(\pi\)</ span > as < span
328+ class ="math display "> \[\mathbb{A}_{\pi}(\pi') = E_{s \sim d_{\pi,
329+ \mu}} [E_{a \sim \pi'(a;s)} [A_\pi(s, a)]]\]</ span > We claim that
330+ the function < span class ="math inline "> \(L_\pi(\pi_\theta) = J(\pi) +
331+ \mathbb{A}_{\pi} (\pi_\theta)\)</ span > (as a function of < span
332+ class ="math inline "> \(\theta\)</ span > ) matches < span
333+ class ="math inline "> \(J(\pi_\theta)\)</ span > up to first order. One way
334+ to prove this, is to consider the expansion: < span
335+ class ="math display "> \[\begin{align*}
336+ J(\pi_\theta) &= J(\pi) + \sum_t \sum_s P(s_t = s | \pi_\theta)
337+ \sum_a \pi_\theta(a | s) \gamma^t A_\pi(s, a)\\
338+ &= \cdots\\
339+ &= J(\pi) + \sum_s \rho_{\pi_\theta}(s) \sum_a \pi_\theta(a|s)
340+ A_\pi(s,a)
341+ \end{align*}\]</ span > And then take the derivative and compare it to the
342+ derivative of < span class ="math inline "> \(L\)</ span > .</ p >
343+ < p > < strong > Exercise.</ strong > Fill the gaps in the previous
344+ expansion.</ p >
345+ < p > < strong > Exercise.</ strong > Using the outline above, or in any other
346+ way, prove that < span class ="math inline "> \(L_\pi(\pi_\theta)\)</ span >
347+ agrees with < span class ="math inline "> \(J(\pi_\theta)\)</ span > up to
348+ first order.</ p >
349+ < p > There is a natural question to ask now.</ p >
313350< blockquote >
314- < p > If you think about a trajectory, there are two “components”: the
315- user, and the universe ones. If we perturb < span
316- class ="math inline "> \(\theta\)</ span > , the effect of this should be
317- expressible in terms of derivatives of the policy only, since the
318- actions of the universe do not depend on < span
319- class ="math inline "> \(\theta\)</ span > .</ p >
351+ < p > We could simply optimize < span class ="math inline "> \(J\)</ span > by
352+ substituting it with < span class ="math inline "> \(L\)</ span > , and using
353+ some sort of gradient based method. But how long should our step size
354+ be? Is there any way to use the fact that < span
355+ class ="math inline "> \(L\)</ span > and < span
356+ class ="math inline "> \(J\)</ span > are not just arbitrary functions, but
357+ have a fair amount of structure due to the fact that they arise as RL
358+ reward functions?</ p >
320359</ blockquote >
321- < p > Let us turn this into mathematics now. Note that < span
322- class ="math display "> \[\nabla_\theta P(\tau | \pi_\theta) = P(\tau |
323- \pi_\theta) \left( \nabla_\theta \log P(\tau | \pi_{\theta}) \right)
324- \]</ span > The probability of a trajectory < span
325- class ="math inline "> \(\tau = (s_0, a_0, \ldots)\)</ span > given an
326- initial distribution < span class ="math inline "> \(\theta\)</ span > is
327- given by: < span class ="math display "> \[P(\tau | \theta) = \rho(s_0)
328- \prod_i P(s_{i+1} ; s_i, a_i) \pi_{\theta}(a_i | s_i)\]</ span > This way
329- < span class ="math display "> \[\log P(\tau | \theta) = \log \rho (s_0) +
330- \sum_i \log P(s_{i+1}; s_i, a_i) + \pi_{\theta}(a_i|s_i)\]</ span > So if
331- now we take the derivative of this with respect to < span
332- class ="math inline "> \(\theta\)</ span > , all the < span
333- class ="math inline "> \(P(s_{i+1}; s_i, a_i)\)</ span > terms will vanish,
334- since they are not dependent on the parameter < span
335- class ="math inline "> \(\theta\)</ span > .</ p >
336- < h2 id ="mixture-policies. "> Mixture policies.</ h2 >
360+ < h3 id ="mixture-policies. "> Mixture policies.</ h3 >
337361< p > < em > The idea of this section was originally published in < a
338362href ="https://dl.acm.org/doi/10.5555/645531.656005 "> here</ a > </ em > .</ p >
339- < p > The main idea here is that instead of simply thinking about < span
340- class ="math inline "> \(\eta_\rho(\pi)\)</ span > , we investigate the
341- expression < span class ="math display "> \[\eta_\mu(\pi) = E_{s \sim \mu}
342- [V_\pi(s)]\]</ span > for any < em > restart distribution</ em > < span
343- class ="math inline "> \(\mu\)</ span > . While it is clear, that a policy
344- that is optimal for < span class ="math inline "> \(\rho\)</ span > is also
345- optimal for any other restart distribution (why?), this statement
346- becomes false, if we restrict to a subset of policies (e.g. the codomain
347- of some parametric family of distributions).</ p >
348- < p > We saw before a naive way of updating policies. by simply replacing a
349- policy with another one, hopefully constructed in such a way that it
350- performs better in every situation. Instead, we can combine two policies
351- in a more subtle way: < span
363+ < p > Let’s try to apply the previous idea in the simples setting,
364+ i.e. along a line. < span
352365class ="math display "> \[\pi_{\text{new}}^{\alpha}(a;s) = (1 - \alpha)
353366\pi(a;s) + \alpha \pi'(a; s)\]</ span > Where < span
354- class ="math inline "> \(0 \leq \alpha \leq 1\)</ span > . Our hope is that if
355- the policy < span class ="math inline "> \(\pi'\)</ span > is
356- < em > sometimes better</ em > than < span class ="math inline "> \(\pi\)</ span > ,
357- then there is an < span class ="math inline "> \(\alpha\)</ span > such that
358- the resulting < span class ="math inline "> \(\pi_{\text{new}}\)</ span > is a
359- better strategy that < span class ="math inline "> \(\pi\)</ span > .</ p >
360- < p > To understand this idea more deeply, we define the < strong > policy
361- advantage</ strong > < span class ="math inline "> \(\mathbb{A}_{\pi,
362- \mu}(\pi')\)</ span > of a policy < span
363- class ="math inline "> \(\pi'\)</ span > over another policy < span
364- class ="math inline "> \(\pi\)</ span > as < span
365- class ="math display "> \[\mathbb{A}_{\pi, \mu}(\pi') = E_{s \sim
366- d_{\pi, \mu}} [E_{a \sim \pi'(a;s)} [A_\pi(s, a)]]\]</ span > This
367- way, using the simple gradient result: < span
368- class ="math display "> \[\frac{\partial \eta_\mu}{\partial \alpha}
369- \bigg|_{\alpha = 0} = \frac{1}{1-\gamma} \mathbb{A}_{\pi,
370- \mu}(\pi')\]</ span > Thus a positive advantage implies the existence
371- of a sufficiently small < span class ="math inline "> \(\alpha\)</ span > such
372- that the policy < span class ="math inline "> \(\pi_{\text{new}}\)</ span > is
373- better than < span class ="math inline "> \(\pi\)</ span > . We summarize this
374- in the following statement:</ p >
375- < p > < strong > Theorem.</ strong > Given a policies < span
376- class ="math inline "> \(\pi\)</ span > and < span
377- class ="math inline "> \(\pi'\)</ span > and a starting distribution
378- < span class ="math inline "> \(\mu\)</ span > , there is < span
379- class ="math inline "> \(0 < \alpha \leq 1\)</ span > such that the policy
380- < span class ="math inline "> \(\pi_{\text{new}}^\alpha\)</ span > is better
381- than < span class ="math inline "> \(\pi\)</ span > with the starting
382- distribution < span class ="math inline "> \(\mu\)</ span > < em > iff</ em > < span
383- class ="math inline "> \(\mathbb{A}_{\pi, \mu}(\pi') >
384- 0\)</ span > .</ p >
367+ class ="math inline "> \(0 \leq \alpha \leq 1\)</ span > . We have:</ p >
368+ < p > < span class ="math display "> \[\begin{align*}
369+ J(\pi_\text{new}^\alpha) &= L_\pi(\pi') + O(\alpha^2)\\
370+ &= J(\pi) + \mathbb{A}_{\pi}((1- \alpha) \pi + \alpha \pi') +
371+ O(\alpha^2)\\
372+ &= J(\pi) + \alpha \mathbb{A}_{\pi}(\pi') + O(\alpha^2)
373+ \end{align*}\]</ span > That is: < span
374+ class ="math display "> \[\frac{\partial J}{\partial \alpha} \bigg|_{\alpha
375+ = 0} = \frac{1}{1-\gamma} \mathbb{A}_{\pi, \mu}(\pi')\]</ span > Thus
376+ a positive advantage implies the existence of a sufficiently small < span
377+ class ="math inline "> \(\alpha\)</ span > such that the policy < span
378+ class ="math inline "> \(\pi_{\text{new}}\)</ span > is better than < span
379+ class ="math inline "> \(\pi\)</ span > .</ p >
385380< p > < strong > Theorem.</ strong > Let < span
386381class ="math inline "> \(\mathbb{A}\)</ span > be the policy advantage of
387382< span class ="math inline "> \(\pi'\)</ span > with respect to < span
@@ -393,8 +388,8 @@ <h2 id="mixture-policies.">Mixture policies.</h2>
393388class ="math display "> \[\eta_\mu(\pi_{new}) - \eta_{\mu}(\pi) \geq
394389\frac{\alpha}{1 - \gamma} \left( \mathbb{A} - \frac{2 \alpha \gamma
395390\varepsilon}{1- \gamma(1 - \alpha)} \right)\]</ span > </ p >
396- < h2 id ="beyond-mixture-policies---trpo "> Beyond mixture policies -
397- TRPO</ h2 >
391+ < h3 id ="beyond-mixture-policies---trpo "> Beyond mixture policies -
392+ TRPO</ h3 >
398393< p > < em > The idea of this section was originally published in < a
399394href ="https://arxiv.org/abs/1502.05477 "> here</ a > </ em > .</ p >
400395< p > Fundamentally, we are motivated by the following thought, that
@@ -423,7 +418,33 @@ <h2 id="beyond-mixture-policies---trpo">Beyond mixture policies -
423418D_{TV}^{\text{max}}(\pi_{\text{old}}, \pi_{\text{new}})\)</ span > . Then:
424419< span class ="math display "> \[\eta(\pi_{\text{new}}) \geq
425420L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{4 \varepsilon \gamma}{(1
426- - \gamma)^2}\]</ span > </ p >
421+ - \gamma)^2} \alpha^2\]</ span > </ p >
422+ < p > Now, in practice computing the total variation divergence is not
423+ always feasible. It turns out though that it relates to the
424+ KL-divergence in a very useful way: < span
425+ class ="math display "> \[D_{\text{TV}}(p || q)^2 \leq D_{\text{KL}}(p
426+ || q)\]</ span > And we set < span
427+ class ="math display "> \[D_{\text{KL}}^{\text{max}} (\pi, \tilde{\pi}) =
428+ \max_s D_{\text{KL}}(\pi(\cdot | s) || \tilde{\pi}(\cdot | s))\]</ span >
429+ Which yields the following form of the above theorem: < span
430+ class ="math display "> \[\eta(\tilde{\pi}) \geq L_{\pi}(\tilde{\pi}) -
431+ \frac{4 \varepsilon \gamma}{(1 - \gamma)^2} D_{\text{KL}}^{\text{max}}(
432+ \pi, \tilde{\pi})\]</ span > </ p >
433+ < p > An interesting remark to note here.</ p >
434+ < blockquote >
435+ < p > We could consider the policy optimization iteration given by the
436+ pseudocode:</ p >
437+ < pre > < code > while (not converged):
438+ pi_new = argmax_pi L_{pi_i}(pi) - CD_KL_max (pi_old, pi)</ code > </ pre >
439+ < p > Which can easily be shown to monotonically improve the policy in
440+ every iteration. The interesting thing is to note the similarity between
441+ this method and < a href ="proximal_methods "> proximal methods</ a > in
442+ optimization theory. It turns out, that if one simply optimizes this
443+ objective function with the constant < span
444+ class ="math inline "> \(C\)</ span > as suggested above, the step sizes of
445+ the method end up being really small. Instead, one can consider a < a
446+ href ="trust_region_type_methods "> trust-region type method</ a > :</ p >
447+ </ blockquote >
427448< h1 id ="references "> References</ h1 >
428449< p > < a
429450href ="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient "> OpenAI
0 commit comments