Markov reward process

A Markov reward process is defined by the tuple $⟨ S, P, R, γ ⟩$ , where:

$S$ is the state space
$P_{s s^{'}} = P [S_{t + 1} = s^{'} ∣ S_{t} = s]$ is the state transition probability matrix
$R$ is the reward function, where $R_{s} = E [R_{t = 1} ∣ S_{t} = s]$ .
$γ \in [0, 1]$ is discount factor for future rewards.

MRP vs Markov chain

The only difference between an MRP and a Markov chain is that there is now a reward associated with each state, and the agent will select an action $a$ that will get the most reward $R_{s^{'}}$ at time $t$ .

The return $G_{t}$ for an MRP is the total discounted reward from time-step $t$ : $G_{t} = R_{t + 1} + γ R_{t + 2} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$

The state-value function for an MRP is defined as: $v (s) = E [G_{t} ∣ S_{t} = s]$

Example: A student's MRP

with value function:

Apply the Bellman optimality equation:

Security Memo

Recent Notes

SMART

Bossa Nova

ZFS

post-rock

2024-09-27

Markov reward process

Graph View

Sources

Backlinks