This note is a part of my Zettelkasten. What is below might not be complete or accurate. It
is also likely to change often.
This is a review of my notes on MDPs from CS6300 Artificial Intelligence.
Markov Decision Processes are a family of non-deterministic search problems. They are characterized by the following properties:
The solution to a MDP is usually a policy - which can determine action given any state. The optimal policy is the one gives the maximum expected utility.
If there is an environment which changes with time, there will not be a single solution which is optimal. The optimal policy will have to change with time too.
For purposes of simplicity (in this course), we assume that the environment is static.
Utility is the sum of rewards of all states/transitions from the start to the end. Expected utility is the expectation of the value of utility (since its a stochastic environment and exact utility cannot be calculated)
If there is a static environment and additive rewards, the expected utility could add up to infinity. In that case, all decision making will breakdown. To prevent this, we have some possible solutions:
Using discounted utility, the expected optimal utility turns out to be E(Sumt( γt . rt )). The optimal policy is argmax(V(s)) - the action with maximum expected utility. If 0<γ<1, this policy is independent of starting state since the utility will just converges at t=infinity.
Bellmans Equations are a set of iterative equations to solve for expected rewards.
Q*(state, action) = Sums' [ T(s,a,s')(R(s,a,s') + γV*(s')) ]
where
V*(state) = maxaction (Q*(state, action))
Given a Markov Decision Process, and using the bellman's equations, there are two primary ways to figure out the optimal policy. These are Value iteration and Policy Iteration.
The value iteration algorithms, in simple words, finds the converged (or close to converged) utility of each state by iterating using the bellman's equations.
Policy extraction is basically:
policy(s) = argmaxa(Q(s,a))
Using the V* of the last iteration.
Policy Iteration is a modified of value iteration which converges much faster.
If the environment is known fully, the situation is simple as we see above. In case the environment is not known or understood, we need to utilize Reinforcement Learning.