next up previous
Next: Linear Programming Up: No Title Previous: Example: Running Policy Iteration

   
A Comparison between VI and PI Algorithms

In this section we will compare the convergence rate of the VI and PI algorithms. It will be shown that, assuming that the two algorithms begin with the same approximated value, the PI algorithm converges faster to the optimized value.

Theorem 6.6   Let $\{u_{i}\}$ be the series of values created by the VI algorithm (where un+1=Lun) and let $\{v_{i}\}$ be the series of values created by PI algorithm. If u0=v0, then $\forall n,\ u_{n}\leq v_{n}\leq
v^*_\lambda$.

Proof:We will use induction to prove the theorem.
Induction Basis: We assume that u0=v0. v0 is the return value of a specific policy, and therefore it is clearly $\leq$ the optimal return value. Therefore: $u_{0} \leq v_{0} \leq
v^*_\lambda$.
Induction Step:

\begin{eqnarray*}u_{n+1} = Lu_{n} = L_{p_{n}}u_{n} \ \ \ \ [Where \ p_{n} \in
argmax_{d \in \Pi^{MD}}\{r_{d} + \lambda P_{d}u_{n}\}]
\end{eqnarray*}


From the induction hypothesis $u_n \leq v_n$, and since Lpn is monotonic it follows that:

\begin{eqnarray*}L_{p_{n}}u_{n} & \leq &L_{p_{n}}v_{n}
\end{eqnarray*}


Since L is taking the maximum over all policies:

\begin{eqnarray*}L_{p_{n}}v_{n} & \leq & Lv_{n}
\end{eqnarray*}


We denote the policy determined by PI algorithm in iteration nas dn and therefore: Lvn=Ldnvn

From the Optimality Equations we get:

\begin{eqnarray*}L_{d_{n}}v_{n} &\leq& L_{d_{n}}v_{\lambda}^{d_{n}}
\end{eqnarray*}


From the definition of vn+1 we have:

\begin{eqnarray*}L_{d_{n}}v_{\lambda}^{d_{n}} &=& v_{n+1}
\end{eqnarray*}


And we get $u_{n+1} \leq v_{n+1}$. In Theorem [*] it was proven that $v_{n+1} \leq
v^*_\lambda$ and therefore $u_{n+1} \leq v_{n+1} \leq
v^*_\lambda$.$\Box$

From Theorem [*] it follows that, assuming the same starting point, PI algorithm requires less stages then VI algorithm to converge to the optimal policy. Yet, it should be noticed that each single stage of PI requires a solution of a set of linear equations (the policy evaluation stage) and therefore it is computationally more expensive than a single stage of VI algorithm.


next up previous
Next: Linear Programming Up: No Title Previous: Example: Running Policy Iteration
Yishay Mansour
1999-12-18