next up previous
Next: Temporal Difference and TD(0) Up: No Title Previous: No Title

   
Another way to look on MC algorithm

In lecture 7 we discussed Monte-Carlo (MC) method for evaluation policy reward.
This method performs number of experiments and uses the average to evaluate policy reward.
Another way to express the evaluation is the following:
$\
V_{n+1}(s) = V_{n}(s) + \alpha(R_{n}(s) - V_{n}(s)) $
where $\
R_{n}$ is total reward of n-th run starting first visit in s (given s was visited in this run).
Note that $\ E[R_{n}(s)] =
V^{\pi}(s) $
We can rewrite this formula ,as follows,
$\ V_{n+1}(s) =
(1-\alpha)V_{n}(s) + \alpha[V^{\pi}_{n}(s) + (R_{n}(s) -
V^{\pi}_{n}(s))]$
where: $\ HV_{n} = V^{\pi}_{n+1} $ - for some nonlinear operator H ,and $\ W_{n} = R_{n} - V^{\pi}_{n} $ is "noise" and $\
E(W_{n}) = 0 $.
Recall the operator $\ L_{\pi}\vec{V} = \vec{R_{\pi}} + \lambda
P_{\pi}\vec{V} $ , that was introduced to compute the return of policy. We've already shown that $\ L_{\pi} $ is a contracting operator.


Yishay Mansour
2000-01-06