Skip to content

Commit c317975

Browse files
committed
[mdp] equation clarifications
1 parent 9caa8a4 commit c317975

1 file changed

Lines changed: 4 additions & 4 deletions

File tree

mdp.qmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ In this chapter, we'll first study *Markov decision processes* (MDPs), which pro
1111
Formally, a Markov decision process is $\langle \mathcal S, \mathcal A, \mathrm{T}, \mathrm{R},
1212
\gamma\rangle$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, and:
1313

14-
- $\mathrm{T} : \mathcal S \times \mathcal A \times \mathcal S \rightarrow \mathbb{R}$ is a *transition model*, where
14+
- $\mathrm{T} : \mathcal S \times \mathcal A \times \mathcal S \rightarrow \in [0, 1]$ is a *transition model*, where
1515
$$
1616
\mathrm{T}(s, a, s') = Pr(S_t = s'|S_{t - 1} = s,
1717
A_{t - 1} = a)\;\;,
@@ -31,7 +31,7 @@ The notation $S_t = s'$ uses a capital letter $S$ to stand for
3131

3232
MDPs also satisfy the Markov property, which means the next-state distribution depends only on the current state and action, not on the Past. Formally, the Markov property is expressed as:
3333
$$
34-
Pr(S_{t+1} = s_{t+1} | S_t = s_t, A_t = a_t, S_{t-1} = s_{t-1}, A_{t-1} = a_{t-1}, \ldots, S_0 = s_0, A_0 = a_0) = Pr(S_{t+1} = s_{t+1} | S_t = s_t, A_t = a_t)
34+
Pr(S_{t} = s_{t} | S_{t-1} = s_{t-1}, A_{t-1} = a_{t-1}, \ldots, S_0 = s_0, A_0 = a_0) = Pr(S_{t} = s_{t} | S_{t-1} = s_{t-1}, A_{t-1} = a_{t-1})
3535
$$
3636

3737
In other words, the future is only dependent on the present, and not on the past.
@@ -162,7 +162,7 @@ $${#eq-exp_infinite}
162162
Note that the $t$ indices here are not the number of steps to go, but actually the number of steps forward from the starting state (there is no sensible notion of "steps to go" in the infinite horizon case).
163163
164164
:::{.callout-note}
165-
@eq-exp_finite and @eq-exp_infinite are a conceptual stepping stone. Our main objective is to get to @eq-inf_horiz_value, which can also be viewed as including $\gamma$ in @eq-finite_value, with the appropriate definition of the infinite-horizon value.
165+
@eq-exp_finite and @eq-exp_infinite are a conceptual stepping stone. @eq-inf_horiz_value is the infinite-horizon analogue of @eq-finite_value.
166166
:::
167167
168168
There are two good intuitive motivations for discounting. One is related to economic theory and the present value of money: you'd generally rather have some money today than that same amount of money next week (because you could use it now or invest it). The other is to think of the whole process terminating, with probability $1-\gamma$ on each step of the interaction. (At every
@@ -250,7 +250,7 @@ We can also define the action-value function for a fixed policy $\pi$, denoted b
250250
Similar to $\mathrm{V}^{\pi}_h(s)$, $\mathrm{Q}^{\pi}_h(s,a)$ satisfies the Bellman recursion/equations introduced earlier. In fact, for a deterministic policy $\pi$:
251251
252252
$$
253-
\mathrm{Q}^{\pi}_h(s,\pi(s)) = \mathrm{V}^{\pi}_h(s).
253+
\mathrm{Q}^{\pi}_h(s,\pi_h(s)) = \mathrm{V}^{\pi}_h(s).
254254
$$
255255
256256
However, since our primary goal in dealing with action values is typically to identify an *optimal* policy, we will not dwell extensively on ($\mathrm{Q}^{\pi}_h(s,a)$). Instead, we will place more emphasis on the optimal action-value functions $\mathrm{Q}^{*}_h(s,a)$.

0 commit comments

Comments
 (0)