Next: remarks: Up: Q-learning and SARSA algorithms Previous: Q-learning and SARSA algorithms

## Q-learning

Lets consider Value Iteration Algorithm(VI) from lecture 6.It described the non linear operator: L. In every iteration of the algorithm we operate L:
Vn+1=LVn, and explicitaly:

.

Lets refine the equation a somewhat. We define new function Q regarding VI:

.

Now the iteration of VI are: . Expressed in Q function terms only we have:

.

We write the iteration with .
(In lecture 7 we learned it converges the right value.)

Until now the iterations are equvivalent to VI. Instead of taking the excpetancy of the value of the next step we take a sample of the next step. We assume that we are in state s, we take action a, the next state s' is distributed by P(s'|s,a). Finally we get

 .

Yishay Mansour
2000-01-07