Another way to look on MC algorithm

Next: Temporal Difference and TD(0) Up: No Title Previous: No Title

Another way to look on MC algorithm

In lecture 7 we discussed Monte-Carlo (MC) method for evaluation policy reward.
This method performs number of experiments and uses the average to evaluate policy reward.
Another way to express the evaluation is the following:
$\ V_{n+1}(s) = V_{n}(s) + \alpha(R_{n}(s) - V_{n}(s))$
where $\ R_{n}$ is total reward of n-th run starting first visit in s (given s was visited in this run).
Note that $\ E[R_{n}(s)] = V^{\pi}(s)$
We can rewrite this formula ,as follows,
$\ V_{n+1}(s) = (1-\alpha)V_{n}(s) + \alpha[V^{\pi}_{n}(s) + (R_{n}(s) - V^{\pi}_{n}(s))]$
where: $\ HV_{n} = V^{\pi}_{n+1}$ - for some nonlinear operator H ,and $\ W_{n} = R_{n} - V^{\pi}_{n}$ is "noise" and $\ E(W_{n}) = 0$ .
Recall the operator $\ L_{\pi}\vec{V} = \vec{R_{\pi}} + \lambda P_{\pi}\vec{V}$ , that was introduced to compute the return of policy. We've already shown that $\ L_{\pi}$ is a contracting operator.

Yishay Mansour
2000-01-06