Skip to content

Commit 9e8cc05

Browse files
committed
rl
1 parent 48f72d4 commit 9e8cc05

1 file changed

Lines changed: 111 additions & 90 deletions

File tree

rl.html

Lines changed: 111 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
<!DOCTYPE html>
2-
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
2+
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
33
<head>
44
<meta charset="utf-8" />
55
<meta name="generator" content="pandoc" />
66
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
7-
<title>concepts</title>
7+
<meta name="author" content="Gergo Stomfai" />
8+
<meta name="dcterms.date" content="2025-05-02" />
9+
<title>My Document Title</title>
810
<style>
911
html {
12+
font-size: 12pt;
1013
color: #1a1a1a;
1114
background-color: #fdfdfd;
1215
}
@@ -171,6 +174,11 @@
171174
type="text/javascript"></script>
172175
</head>
173176
<body>
177+
<header id="title-block-header">
178+
<h1 class="title">My Document Title</h1>
179+
<p class="author">Gergo Stomfai</p>
180+
<p class="date">2025-05-02</p>
181+
</header>
174182
<nav id="TOC" role="doc-toc">
175183
<ul>
176184
<li><a href="#short-intro-to-rl" id="toc-short-intro-to-rl">Short intro
@@ -185,17 +193,19 @@
185193
<li><a href="#policy-optimization-methods."
186194
id="toc-policy-optimization-methods.">Policy optimization methods.</a>
187195
<ul>
188-
<li><a href="#approximate-value-function-methods."
189-
id="toc-approximate-value-function-methods.">Approximate Value function
190-
methods.</a></li>
191-
<li><a href="#simplest-policy-gradient"
192-
id="toc-simplest-policy-gradient">Simplest policy gradient</a></li>
196+
<li><a href="#exact-policy-iteration."
197+
id="toc-exact-policy-iteration.">Exact policy iteration.</a></li>
198+
<li><a href="#approximate-value-functions-methods."
199+
id="toc-approximate-value-functions-methods.">Approximate value
200+
functions methods.</a>
201+
<ul>
193202
<li><a href="#mixture-policies." id="toc-mixture-policies.">Mixture
194203
policies.</a></li>
195204
<li><a href="#beyond-mixture-policies---trpo"
196205
id="toc-beyond-mixture-policies---trpo">Beyond mixture policies -
197206
TRPO</a></li>
198207
</ul></li>
208+
</ul></li>
199209
<li><a href="#references" id="toc-references">References</a></li>
200210
</ul>
201211
</nav>
@@ -289,99 +299,84 @@ <h1 id="policy-optimization-methods.">Policy optimization methods.</h1>
289299
<li>We want to be able to investigate the asymptotic behavior of the
290300
method.</li>
291301
</ol>
292-
<h2 id="approximate-value-function-methods.">Approximate Value function
293-
methods.</h2>
302+
<h2 id="exact-policy-iteration.">Exact policy iteration.</h2>
294303
<p>The basic idea is the following. If we have access to the exact value
295304
function, given a policy <span class="math inline">\(\pi\)</span>, we
296305
can compute <span class="math inline">\(Q_\pi(s, a)\)</span> explicitly
297306
for each <span class="math inline">\((s,a)\)</span> pair, and create a
298307
new deterministic policy <span class="math inline">\(\pi&#39;\)</span>,
299308
such that <span class="math display">\[\pi&#39;(a;s) = \text{$1$ iff $a
300309
= \arg \max_a Q_\pi(s,a)$, $0$ otherwise}\]</span></p>
301-
<p>Instead, we can use an approximation of the the value function to
302-
perform the same game. These methods are called <strong>Approximate
303-
Value Function Methods</strong>.</p>
304-
<h2 id="simplest-policy-gradient">Simplest policy gradient</h2>
305-
<p>To take the theory out for a spin, consider the following. Let <span
306-
class="math inline">\(\pi_\theta\)</span> be a parametric family of
307-
policies. Suppose that the initial state of the universe is determined
308-
by some initial distribution <span
309-
class="math inline">\(\rho_0\)</span>. We start by thinking about the
310-
distribution <span class="math inline">\(P(\tau | \pi_\theta)\)</span>,
311-
and especially about <span class="math inline">\(\nabla P(\tau |
312-
\pi_\theta)\)</span>. Intuitively, we have the following:</p>
310+
<h2 id="approximate-value-functions-methods.">Approximate value
311+
functions methods.</h2>
312+
<p>One major drawback of Exact Policy Iteration is the fact that it is
313+
only guaranteed to improve the performance of the policy, if the new
314+
policy has non-negative advantage in every state. This is very hard to
315+
guarantee in practice. Thus, instead of trying to optimize <span
316+
class="math inline">\(\eta(\pi)\)</span> directly, we are going to
317+
define a <em>surrogate</em> objective function. To do this let us fix
318+
some notation first.</p>
319+
<p>Suppose that we have a fixed initial distribution, say <span
320+
class="math inline">\(\mu\)</span>. We will implicitly assume that this
321+
exists, but won’t mention that it exists very often. Then, the
322+
<em>discounted visitation frequencies</em> are given by: <span
323+
class="math display">\[\rho_\pi(s) = P(s_0 = s) + \gamma P(s_1 = s) +
324+
\cdots\]</span> And the <em>policy advantage</em> <span
325+
class="math inline">\(\mathbb{A}_{\pi}(\pi&#39;)\)</span> of a policy
326+
<span class="math inline">\(\pi&#39;\)</span> over another policy <span
327+
class="math inline">\(\pi\)</span> as <span
328+
class="math display">\[\mathbb{A}_{\pi}(\pi&#39;) = E_{s \sim d_{\pi,
329+
\mu}} [E_{a \sim \pi&#39;(a;s)} [A_\pi(s, a)]]\]</span> We claim that
330+
the function <span class="math inline">\(L_\pi(\pi_\theta) = J(\pi) +
331+
\mathbb{A}_{\pi} (\pi_\theta)\)</span> (as a function of <span
332+
class="math inline">\(\theta\)</span>) matches <span
333+
class="math inline">\(J(\pi_\theta)\)</span> up to first order. One way
334+
to prove this, is to consider the expansion: <span
335+
class="math display">\[\begin{align*}
336+
J(\pi_\theta) &amp;= J(\pi) + \sum_t \sum_s P(s_t = s | \pi_\theta)
337+
\sum_a \pi_\theta(a | s) \gamma^t A_\pi(s, a)\\
338+
&amp;= \cdots\\
339+
&amp;= J(\pi) + \sum_s \rho_{\pi_\theta}(s) \sum_a \pi_\theta(a|s)
340+
A_\pi(s,a)
341+
\end{align*}\]</span> And then take the derivative and compare it to the
342+
derivative of <span class="math inline">\(L\)</span>.</p>
343+
<p><strong>Exercise.</strong> Fill the gaps in the previous
344+
expansion.</p>
345+
<p><strong>Exercise.</strong> Using the outline above, or in any other
346+
way, prove that <span class="math inline">\(L_\pi(\pi_\theta)\)</span>
347+
agrees with <span class="math inline">\(J(\pi_\theta)\)</span> up to
348+
first order.</p>
349+
<p>There is a natural question to ask now.</p>
313350
<blockquote>
314-
<p>If you think about a trajectory, there are two “components”: the
315-
user, and the universe ones. If we perturb <span
316-
class="math inline">\(\theta\)</span>, the effect of this should be
317-
expressible in terms of derivatives of the policy only, since the
318-
actions of the universe do not depend on <span
319-
class="math inline">\(\theta\)</span>.</p>
351+
<p>We could simply optimize <span class="math inline">\(J\)</span> by
352+
substituting it with <span class="math inline">\(L\)</span>, and using
353+
some sort of gradient based method. But how long should our step size
354+
be? Is there any way to use the fact that <span
355+
class="math inline">\(L\)</span> and <span
356+
class="math inline">\(J\)</span> are not just arbitrary functions, but
357+
have a fair amount of structure due to the fact that they arise as RL
358+
reward functions?</p>
320359
</blockquote>
321-
<p>Let us turn this into mathematics now. Note that <span
322-
class="math display">\[\nabla_\theta P(\tau | \pi_\theta) = P(\tau |
323-
\pi_\theta) \left( \nabla_\theta \log P(\tau | \pi_{\theta}) \right)
324-
\]</span> The probability of a trajectory <span
325-
class="math inline">\(\tau = (s_0, a_0, \ldots)\)</span> given an
326-
initial distribution <span class="math inline">\(\theta\)</span> is
327-
given by: <span class="math display">\[P(\tau | \theta) = \rho(s_0)
328-
\prod_i P(s_{i+1} ; s_i, a_i) \pi_{\theta}(a_i | s_i)\]</span> This way
329-
<span class="math display">\[\log P(\tau | \theta) = \log \rho (s_0) +
330-
\sum_i \log P(s_{i+1}; s_i, a_i) + \pi_{\theta}(a_i|s_i)\]</span> So if
331-
now we take the derivative of this with respect to <span
332-
class="math inline">\(\theta\)</span>, all the <span
333-
class="math inline">\(P(s_{i+1}; s_i, a_i)\)</span> terms will vanish,
334-
since they are not dependent on the parameter <span
335-
class="math inline">\(\theta\)</span>.</p>
336-
<h2 id="mixture-policies.">Mixture policies.</h2>
360+
<h3 id="mixture-policies.">Mixture policies.</h3>
337361
<p><em>The idea of this section was originally published in <a
338362
href="https://dl.acm.org/doi/10.5555/645531.656005">here</a></em>.</p>
339-
<p>The main idea here is that instead of simply thinking about <span
340-
class="math inline">\(\eta_\rho(\pi)\)</span>, we investigate the
341-
expression <span class="math display">\[\eta_\mu(\pi) = E_{s \sim \mu}
342-
[V_\pi(s)]\]</span> for any <em>restart distribution</em> <span
343-
class="math inline">\(\mu\)</span>. While it is clear, that a policy
344-
that is optimal for <span class="math inline">\(\rho\)</span> is also
345-
optimal for any other restart distribution (why?), this statement
346-
becomes false, if we restrict to a subset of policies (e.g. the codomain
347-
of some parametric family of distributions).</p>
348-
<p>We saw before a naive way of updating policies. by simply replacing a
349-
policy with another one, hopefully constructed in such a way that it
350-
performs better in every situation. Instead, we can combine two policies
351-
in a more subtle way: <span
363+
<p>Let’s try to apply the previous idea in the simples setting,
364+
i.e. along a line. <span
352365
class="math display">\[\pi_{\text{new}}^{\alpha}(a;s) = (1 - \alpha)
353366
\pi(a;s) + \alpha \pi&#39;(a; s)\]</span> Where <span
354-
class="math inline">\(0 \leq \alpha \leq 1\)</span>. Our hope is that if
355-
the policy <span class="math inline">\(\pi&#39;\)</span> is
356-
<em>sometimes better</em> than <span class="math inline">\(\pi\)</span>,
357-
then there is an <span class="math inline">\(\alpha\)</span> such that
358-
the resulting <span class="math inline">\(\pi_{\text{new}}\)</span> is a
359-
better strategy that <span class="math inline">\(\pi\)</span>.</p>
360-
<p>To understand this idea more deeply, we define the <strong>policy
361-
advantage</strong> <span class="math inline">\(\mathbb{A}_{\pi,
362-
\mu}(\pi&#39;)\)</span> of a policy <span
363-
class="math inline">\(\pi&#39;\)</span> over another policy <span
364-
class="math inline">\(\pi\)</span> as <span
365-
class="math display">\[\mathbb{A}_{\pi, \mu}(\pi&#39;) = E_{s \sim
366-
d_{\pi, \mu}} [E_{a \sim \pi&#39;(a;s)} [A_\pi(s, a)]]\]</span> This
367-
way, using the simple gradient result: <span
368-
class="math display">\[\frac{\partial \eta_\mu}{\partial \alpha}
369-
\bigg|_{\alpha = 0} = \frac{1}{1-\gamma} \mathbb{A}_{\pi,
370-
\mu}(\pi&#39;)\]</span> Thus a positive advantage implies the existence
371-
of a sufficiently small <span class="math inline">\(\alpha\)</span> such
372-
that the policy <span class="math inline">\(\pi_{\text{new}}\)</span> is
373-
better than <span class="math inline">\(\pi\)</span>. We summarize this
374-
in the following statement:</p>
375-
<p><strong>Theorem.</strong> Given a policies <span
376-
class="math inline">\(\pi\)</span> and <span
377-
class="math inline">\(\pi&#39;\)</span> and a starting distribution
378-
<span class="math inline">\(\mu\)</span>, there is <span
379-
class="math inline">\(0 &lt; \alpha \leq 1\)</span> such that the policy
380-
<span class="math inline">\(\pi_{\text{new}}^\alpha\)</span> is better
381-
than <span class="math inline">\(\pi\)</span> with the starting
382-
distribution <span class="math inline">\(\mu\)</span> <em>iff</em> <span
383-
class="math inline">\(\mathbb{A}_{\pi, \mu}(\pi&#39;) &gt;
384-
0\)</span>.</p>
367+
class="math inline">\(0 \leq \alpha \leq 1\)</span>. We have:</p>
368+
<p><span class="math display">\[\begin{align*}
369+
J(\pi_\text{new}^\alpha) &amp;= L_\pi(\pi&#39;) + O(\alpha^2)\\
370+
&amp;= J(\pi) + \mathbb{A}_{\pi}((1- \alpha) \pi + \alpha \pi&#39;) +
371+
O(\alpha^2)\\
372+
&amp;= J(\pi) + \alpha \mathbb{A}_{\pi}(\pi&#39;) + O(\alpha^2)
373+
\end{align*}\]</span> That is: <span
374+
class="math display">\[\frac{\partial J}{\partial \alpha} \bigg|_{\alpha
375+
= 0} = \frac{1}{1-\gamma} \mathbb{A}_{\pi, \mu}(\pi&#39;)\]</span> Thus
376+
a positive advantage implies the existence of a sufficiently small <span
377+
class="math inline">\(\alpha\)</span> such that the policy <span
378+
class="math inline">\(\pi_{\text{new}}\)</span> is better than <span
379+
class="math inline">\(\pi\)</span>.</p>
385380
<p><strong>Theorem.</strong> Let <span
386381
class="math inline">\(\mathbb{A}\)</span> be the policy advantage of
387382
<span class="math inline">\(\pi&#39;\)</span> with respect to <span
@@ -393,8 +388,8 @@ <h2 id="mixture-policies.">Mixture policies.</h2>
393388
class="math display">\[\eta_\mu(\pi_{new}) - \eta_{\mu}(\pi) \geq
394389
\frac{\alpha}{1 - \gamma} \left( \mathbb{A} - \frac{2 \alpha \gamma
395390
\varepsilon}{1- \gamma(1 - \alpha)} \right)\]</span></p>
396-
<h2 id="beyond-mixture-policies---trpo">Beyond mixture policies -
397-
TRPO</h2>
391+
<h3 id="beyond-mixture-policies---trpo">Beyond mixture policies -
392+
TRPO</h3>
398393
<p><em>The idea of this section was originally published in <a
399394
href="https://arxiv.org/abs/1502.05477">here</a></em>.</p>
400395
<p>Fundamentally, we are motivated by the following thought, that
@@ -423,7 +418,33 @@ <h2 id="beyond-mixture-policies---trpo">Beyond mixture policies -
423418
D_{TV}^{\text{max}}(\pi_{\text{old}}, \pi_{\text{new}})\)</span>. Then:
424419
<span class="math display">\[\eta(\pi_{\text{new}}) \geq
425420
L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{4 \varepsilon \gamma}{(1
426-
- \gamma)^2}\]</span></p>
421+
- \gamma)^2} \alpha^2\]</span></p>
422+
<p>Now, in practice computing the total variation divergence is not
423+
always feasible. It turns out though that it relates to the
424+
KL-divergence in a very useful way: <span
425+
class="math display">\[D_{\text{TV}}(p || q)^2 \leq D_{\text{KL}}(p
426+
|| q)\]</span> And we set <span
427+
class="math display">\[D_{\text{KL}}^{\text{max}} (\pi, \tilde{\pi}) =
428+
\max_s D_{\text{KL}}(\pi(\cdot | s) || \tilde{\pi}(\cdot | s))\]</span>
429+
Which yields the following form of the above theorem: <span
430+
class="math display">\[\eta(\tilde{\pi}) \geq L_{\pi}(\tilde{\pi}) -
431+
\frac{4 \varepsilon \gamma}{(1 - \gamma)^2} D_{\text{KL}}^{\text{max}}(
432+
\pi, \tilde{\pi})\]</span></p>
433+
<p>An interesting remark to note here.</p>
434+
<blockquote>
435+
<p>We could consider the policy optimization iteration given by the
436+
pseudocode:</p>
437+
<pre><code>while (not converged):
438+
pi_new = argmax_pi L_{pi_i}(pi) - CD_KL_max (pi_old, pi)</code></pre>
439+
<p>Which can easily be shown to monotonically improve the policy in
440+
every iteration. The interesting thing is to note the similarity between
441+
this method and <a href="proximal_methods">proximal methods</a> in
442+
optimization theory. It turns out, that if one simply optimizes this
443+
objective function with the constant <span
444+
class="math inline">\(C\)</span> as suggested above, the step sizes of
445+
the method end up being really small. Instead, one can consider a <a
446+
href="trust_region_type_methods">trust-region type method</a>:</p>
447+
</blockquote>
427448
<h1 id="references">References</h1>
428449
<p><a
429450
href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient">OpenAI

0 commit comments

Comments
 (0)