You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: mdp.qmd
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ In this chapter, we'll first study *Markov decision processes* (MDPs), which pro
11
11
Formally, a Markov decision process is $\langle \mathcal S, \mathcal A, \mathrm{T}, \mathrm{R},
12
12
\gamma\rangle$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, and:
13
13
14
-
- $\mathrm{T} : \mathcal S \times \mathcal A \times \mathcal S \rightarrow \mathbb{R}$ is a *transition model*, where
14
+
- $\mathrm{T} : \mathcal S \times \mathcal A \times \mathcal S \rightarrow \in [0, 1]$ is a *transition model*, where
15
15
$$
16
16
\mathrm{T}(s, a, s') = Pr(S_t = s'|S_{t - 1} = s,
17
17
A_{t - 1} = a)\;\;,
@@ -31,7 +31,7 @@ The notation $S_t = s'$ uses a capital letter $S$ to stand for
31
31
32
32
MDPs also satisfy the Markov property, which means the next-state distribution depends only on the current state and action, not on the Past. Formally, the Markov property is expressed as:
In other words, the future is only dependent on the present, and not on the past.
@@ -162,7 +162,7 @@ $${#eq-exp_infinite}
162
162
Note that the $t$ indices here are not the number of steps to go, but actually the number of steps forward from the starting state (there is no sensible notion of "steps to go" in the infinite horizon case).
163
163
164
164
:::{.callout-note}
165
-
@eq-exp_finite and @eq-exp_infinite are a conceptual stepping stone. Our main objective is to get to @eq-inf_horiz_value, which can also be viewed as including $\gamma$ in @eq-finite_value, with the appropriate definition of the infinite-horizon value.
165
+
@eq-exp_finite and @eq-exp_infinite are a conceptual stepping stone. @eq-inf_horiz_value is the infinite-horizon analogue of @eq-finite_value.
166
166
:::
167
167
168
168
There are two good intuitive motivations for discounting. One is related to economic theory and the present value of money: you'd generally rather have some money today than that same amount of money next week (because you could use it now or invest it). The other is to think of the whole process terminating, with probability $1-\gamma$ on each step of the interaction. (At every
@@ -250,7 +250,7 @@ We can also define the action-value function for a fixed policy $\pi$, denoted b
250
250
Similar to $\mathrm{V}^{\pi}_h(s)$, $\mathrm{Q}^{\pi}_h(s,a)$ satisfies the Bellman recursion/equations introduced earlier. In fact, for a deterministic policy $\pi$:
However, since our primary goal in dealing with action values is typically to identify an *optimal* policy, we will not dwell extensively on ($\mathrm{Q}^{\pi}_h(s,a)$). Instead, we will place more emphasis on the optimal action-value functions $\mathrm{Q}^{*}_h(s,a)$.
0 commit comments