Next: Existence of a unique Up: No Title Previous: Assumptions

Calculating the Return Value of a Given Policy

According to Theorem 4.3 from the previous lecture, for each stochastic history dependent policy $\pi=(d_1, d_2, ...) \in \Pi^{HR}$ there exists a Markovian stochastic policy $\pi^{'}=(d_1^{'}, d_2^{'}, ...) \in \Pi^{MR}$ that has the same return, i.e., $v_\lambda^\pi=v_\lambda^{\pi^{'}}$ .

Let $\pi\in\pi^{MR}$ , then

$\begin{eqnarray*}v_\lambda^\pi(s) & = & E_s^{\pi}[\sum_{t=1}^\infty\lambda^{t-1... ...Y_t)] = \sum_{t=1}^{\infty}\lambda^{t-1}P_{\pi}^{t-1}r_{d_{t}} \end{eqnarray*}$

$\begin{eqnarray*}\vec{v}_\lambda^\pi & = & \vec{r}_{d_{1}} + \lambda P_{d_{1}}[... ...lambda P_{d_{2}}\vec{r}_{d_{3}}+\ldots}_{v_\lambda^{\pi^{'}}}] \end{eqnarray*}$

(where $\pi^{'}$ is similar to policy $\pi$ starting from the second step)

$\begin{eqnarray*}\vec{v}_{\lambda}^{\pi} = \vec{r}_{d_{1}}+\lambda P_{d_{1}}\vec{v}_{\lambda}^{\pi^{'}} \end{eqnarray*}$

If $\pi$ is stationary then $\pi^{'}=\pi$ and

$\begin{eqnarray*}\vec{v}_{\lambda}^{\pi} = \vec{r}_{d_{1}}+\lambda P_{d_{1}}\vec{v}_{\lambda}^{\pi} \end{eqnarray*}$

All the parameters aside from $\vec{v}_\lambda^\pi$ are known, thus we have a set of linear equations of the form $\vec{x}=r_{d{1}}+\lambda P_{d_{1}}\vec{x}$ . We will show that these equations have a single solution which is $\vec{v}_\lambda^\pi$ .

Yishay Mansour
1999-11-24