next up previous
Next: Dynamic Programming Algorithm for Up: Finite Horizon Previous: Expected Total Reward

   
Return Estimation of a Given Policy

Let $\pi\in\Pi^{HR}$ be a given policy. We would like to calculate it's value $V^{\pi}_{N}(s)$. An inefficient way of calculating $V^{\pi}_{N}(s)$ is creating all possible histories, finding the return value for each one and calculating the expectation of it. Then we can compute the value by taking weighted sum. An alternative calculation method is calculating return of the last step first, and recurse for the previous steps, until reaching the first step. Let us define a set of variables $U^{\pi}_{t}(h_{t})$ to represent the expected return starting in time t until time N where ht notes the history (previous steps and actions) until time t. We can write $U^{\pi}_t$ as:

\(U^{\pi}_t = E^{\pi}_{h_t}[\sum_{n=t}^{N-1} r_n(X_n,Y_n)+r_N(X_N)]\) where ht = (ht-1, at-1, st)

We can define $U^{\pi}_t$ as a function of $U^{\pi}_{t+1}$ as follows:

\(U^{\pi}_t(h_t) = E^{\pi}h_t[r_t(X_t,Y_t)+U_{t+1}(h_t,Y_t,X_{t+1})]\)

Note that Xt is known from the history and Yt is known if $\pi$ is deterministic. Xt+1 is unknown for any policy since the environment is stochastic.

The idea: Calculate the values of $U^{\pi}_t$ from t=N to t=1 using a dynamic programming algorithm.



 
next up previous
Next: Dynamic Programming Algorithm for Up: Finite Horizon Previous: Expected Total Reward
Yishay Mansour
1999-11-15