regarding Q at the moment. Thus we get Sarsa
- if we choose Q to be optimal Q=Q* then
- The algorithm Q-Learning is off-policy since
we don't control the policy that
porforms actions. In general an off-line algorithm doesn't control the actions it does.
- For on-policy we can give any action a small probability s.t. we reach all MDP.
For on-policy we can hope to achieve rewards getting closer to optimal.