next up previous
Next: Online Versus Off-Line Updates Up: No Title Previous: Another way to look

   
Temporal Difference and TD(0) algorithm

We can write down algorithm according to the defined operator:

$\ V_{n+1}(s) \leftarrow (1-\alpha_{n}) V_{n}(s)+
\alpha_{n}[r_{\pi}(s)+ \lambda E_{s}[V(s')]] $
Instead of computing expectation $\ E_{s}$ we sample the next state $\ s'$ and derive

$\ V_{n+1}(s) \leftarrow (1-\alpha_{n}) V_{n}(s)+
\alpha_{n}[r_{\pi}(s)+ \lambda V_{n}(s')] $ or equivalently,
$\ V_{n+1}(s) \leftarrow V_{n}(s)+ \alpha_{n}[r_{\pi}(s)+ \lambda
V_{n}(s') - V_{n}(s)] $
We'll call $\ r_{\pi}(s)+ \lambda V_{n}(s') - V_{n}(s)$ - temporal difference (TD) because this expression is difference between reward received in current run $\ r_{\pi}(s)+ \lambda
V_{n}(s') $ and expected reward $\ V_{n}(s) $.
This algorithm is called TD(0).
It is easy to see that $\ E(r_{\pi}(s)+ \lambda V_{n}(s')) =
E(V_{n}(s))$ and $\ E(TD) = 0 $
The following theorem is used to show convergence of the algorithm:

Theorem 8.1   Consider the iterative algorithm $\ r_{t+1}(i) \leftarrow
(1-\alpha_{t}(i)) r_{t}(i)+ \alpha_{t}(i)[(Hr_{t})(i)+ w_{t}(i)]
$ such that
1) $\ E(w_{t}(i)\vert F_{t}) = 0 $ and $\
E(w^2_{t}(i)\vert F_{t}) = A + B\vert\vert r_{t}\vert\vert^2 $ for some constants A and B
2) the step size $\ \alpha_{t} $ has the property that $\ \sum\alpha_{t} = \infty $ and $\ \sum\alpha^2_{t} < \infty; $
3) H is a contracting mapping.
Then r converges to $\ r^{*} $ with probability one, when $\ r^{*}
= H r^{*} $

Recall that convergence with probability one implies that sequence $\ A_{n} $ has the property that $\lim_{N\rightarrow\infty}sup_{n
\geq N}\vert A_{N} - A\vert = 0 $The above theorem implies that TD(0) converges with probability one to the correct value function (same as MC algorithm).


Yishay Mansour
2000-01-06