Bellman expectation equation for MDP value functions

The Bellman expectation equation for MDPs provides a way to calculate a policy-indexed state-value function $v_{π}$ . We can compute $v_{π}$ by doing a one-step lookahead on the action values of all possible $a$ and the value of the policy function for that action at the current state. The sum obtained should be a summary of how advantageous it is at the state—if all actions have high action values at this state then the state itself must also be advantageous to be in. We can then expand the definition of $q_{π}$ to define $v_{π}$ recursively.

v_{π} (s) = E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1} ∣ S_{t} = s)] = a \in A \sum π (a ∣ s) q_{π} (s, a) = a \in A \sum π (a ∣ s) (R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{π} (s^{'}))

We can also compute the policy-indexed action-value function $q_{π}$ by doing a one-step lookhead, this time using the discounted state-value of all subsequent states. Expanding the definition of $v_{π}$ produces a recursively defined $q_{π}$ .

q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a] = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{π} (s^{'}) = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} a^{'} \in A \sum π (a^{'} ∣ s^{'}) q_{π} (s^{'}, a^{'})

Say if we decide to calculate $v_{π}$ , we can convert the value function to matrix and turn it into a closed-form solution:

v_{π} = R^{π} + γ P^{π} v_{π} = (I - γ P^{π})^{- 1} R^{π}

Security Memo

Recent Notes

SMART

Bossa Nova

ZFS

post-rock

2024-09-27

Bellman expectation equation for MDP value functions

Graph View

Sources

Backlinks