next up previous
Next: Example: Running Policy Iteration Up: Policy Iteration Previous: Policy Iteration Algorithm

   
Convergence of Policy Iteration Algorithm

We shall see that when the number of states - |S|, and the number of actions - |A| are finite, there are no two non-consecutive iterations with the same policy, unless we have an optimal policy. Therefore dn converges to the optimal policy in a finite number of steps. The key to the convergence of dn is the monotonicity of $\{v_n\}$.

Claim 6.4   Let vn, vn+1 be the values of consecutive iterations of the above algorithm, then $v_n \leq v_{n+1} \leq v_{\lambda}^*$.

Proof:Let dn+1 be the policy in the policy improvement step, then

\begin{eqnarray*}r_{d_{n+1}} + \lambda{P_{d_{n+1}}}{v_n} &\geq& r_{d_n} + \lambda{P_{d_n}}{v_n}\\
&=& v_n\ \ \ \ \ \ (by\ definition\ of\ v_n)\\
\end{eqnarray*}


Therefore: $r_{d_{n+1}} \geq (I - \lambda{P_{d_{n+1}}}){v_n}$. Multiplying by $(I - \lambda{P_{d_{n+1}}})^{-1}$ we get:

\begin{eqnarray*}v_{n+1} = (I - \lambda{P_{d_{n+1}}})^{-1}r_{d_{n+1}} \geq v_n
\end{eqnarray*}


$\Box$

Note: The above claim is valid even when S or A are infinite.

Theorem 6.5   Let S and A be finite, then the Policy Iteration algorithm converges to the optimal policy after at most |A||S| iterations.

Proof:Clearly, $\vert A\vert^{\vert S\vert} \geq \vert\Pi^{MD}\vert$. According to claim [*], in each step $v_{n+1} \geq v_n$ except for the last step in which vn+1 = vn. Therefore no policy $\pi \in
\Pi^{MD}$ can appear in two different iterations. Hence the number of iterations $\leq \vert\Pi^{MD}\vert \leq \vert A\vert^{\vert S\vert}$$\Box$


  
Figure: Example Diagram


next up previous
Next: Example: Running Policy Iteration Up: Policy Iteration Previous: Policy Iteration Algorithm
Yishay Mansour
1999-12-18