The Bellman expectation equation for MDPs provides a way to calculate a policy-indexed state-value function . We can compute by doing a one-step lookahead on the action values of all possible and the value of the policy function for that action at the current state. The sum obtained should be a summary of how advantageous it is at the state—if all actions have high action values at this state then the state itself must also be advantageous to be in. We can then expand the definition of to define recursively.
We can also compute the policy-indexed action-value function by doing a one-step lookhead, this time using the discounted state-value of all subsequent states. Expanding the definition of produces a recursively defined .
Say if we decide to calculate , we can convert the value function to matrix and turn it into a closed-form solution: