next up previous
Next: Calculating the Derivative of Up: No Title Previous: Choosing the parameters for

   
TD-Gammon

Let V*(s,0) be the probability of white winning from state s and it is white's turn (assuming white and black are playing optimally). Let V*(s,1) be the probability of white winning from state s and it is black's turn. We estimate V*(s,l) using a neural network which calculates $\mathaccent'176{V}(s,l,r)$.
The Neural Network: Network Initialization: (small) random weights.
Training Method: The program plays for both sides. For each point in time we have a state st, a vector rt, and a turn lt. For each state s' accessible from st (according to the dice), we calculate $\mathaccent'176{V}(s',l_{t},r_{t})$, and choose the best state. (For white's turn choose the state corresponding to the maximum value; for black's - the minimum.)
Updating Parameters: At the end of each turn, we compute:

\begin{displaymath}d_{t}=\mathaccent'176{V}(s_{t+1},l_{t+1},r_{t}) -
\mathaccent'176{V}(s_{t},l_{t},r_{t})\ ,
\end{displaymath}

which is the TD (temporal difference). (In the final state we replace $\mathaccent'176{V}$ with the game outcome.) In addition, we update $\overrightarrow{r_{t}}$:

\begin{displaymath}\overrightarrow{r_{t+1}}\leftarrow
\overrightarrow{r_{t}}+\a...
...-k} \nabla_{r_{k}}
\mathaccent'176{V}(s_{k},l_{k},r_{k})}\ .
\end{displaymath}


\begin{displaymath}\hspace{1.3in}\overrightarrow{e_{t}}\end{displaymath}

At the end of each game a new game is started, and r0 is set to the previous game's parameter vector.
1.
$\alpha$ is set to a constant (determined by experiments).
2.
$\gamma$ does not affect the results significantly in this case.
At the end of the training phase, we get a function $\mathaccent'176{V}(s,l,r)$ (r is fixed) which we can use to play backgammon.
Improvements: Comments:
next up previous
Next: Calculating the Derivative of Up: No Title Previous: Choosing the parameters for
Yishay Mansour
2000-01-17