This note is a part of my Zettelkasten. What is below might not be complete or accurate. It
is also likely to change often.

8th June, 2021
#
Markov Decision Processes

### Environment: Stationary or Dynamic?

### Rewards and Utility

## Bellmans Equations and Policy derivation

### Value Iteration and Policy Extraction

### Policy Iteration

This is a review of my notes on MDPs from CS6300 Artificial Intelligence.

Markov Decision Processes are a family of non-deterministic search problems. They are characterized by the following properties:

**Fully observable**- There is no uncertainity with the environment**Stochastic Environment**- Actions by the agent doesnt solely determine the outcome. There can be other factors**Markovian Transition Model**- The outcome of an action at a state is dependent only on the current state and the action - not on past states.**Additive Rewards**- The rewards add up.

The solution to a MDP is usually a policy - which can determine action given any state. The optimal policy is the one gives the maximum expected utility.

If there is an environment which changes with time, there will not be a single solution which is optimal. The optimal policy will have to change with time too.

For purposes of simplicity (in this course), we assume that the environment is static.

Utility is the sum of rewards of all states/transitions from the start to the end. Expected utility is the expectation of the value of utility (since its a stochastic environment and exact utility cannot be calculated)

If there is a static environment and additive rewards, the expected utility could add up to infinity. In that case, all decision making will breakdown. To prevent this, we have some possible solutions:

- Finite horizons - terminate the search after n steps. However, this results in a non-stationary environment since time becomes a factor in the utility. Still usable in certain situations.
- Inevitable terminal states - there is some assurance that the terminal state will be reached in some finite n steps. All roads lead to termination
- Discounted utility - U(r
_{0}, r_{1}, r_{2}...) = r_{0}+ γ.r_{1}+ γ^{2}r_{2}... where 0<γ<1. This ensures that as n increases to infinity, the additional utility decreases to 0.

Using discounted utility, the expected optimal utility turns out to be E(Sum_{t}( γ^{t} . r_{t} )). The optimal policy is argmax(V(s)) - the action with maximum expected utility. If 0<γ<1, this policy is independent of starting state since the utility will just converges at t=infinity.

Bellmans Equations are a set of iterative equations to solve for expected rewards.

Q^{*}(state, action) = Sum_{s'}[ T(s,a,s')(R(s,a,s') + γV^{*}(s')) ]

where

- T(s,a,s') is the probability of transitioning from s to s' on action a
- R(s,a,s') is the reward of transitioning from s to s' using action a
- V
^{*}(s') is the best known expected Utility at state s' - γ is the discounting factor

V^{*}(state) = max_{action}(Q^{*}(state, action))

Given a Markov Decision Process, and using the bellman's equations, there are two primary ways to figure out the optimal policy. These are Value iteration and Policy Iteration.

The value iteration algorithms, in simple words, finds the converged (or close to converged) utility of each state by iterating using the bellman's equations.

- Initialize expected value of each state V
_{s}= 0 - In each iteration, use the values of previous iteration to recalculate
- Continue iterating till the values converge (difference in values in succesive iterations is less than a threshold)

Policy extraction is basically:

policy(s) = argmax_{a}(Q(s,a))

Using the V^{*} of the last iteration.

Policy Iteration is a modified of value iteration which converges much faster.

- Initialize to a random policy pi
- Calculate converged utility value of all the states under pi (using bellman's equations)
- For each state, check if there is a non-policy action which leads to a higher utility.
- If yes, change policy pi to contain the new action

- If policy pi was updated in the previous step, rerun from step 2
- If policy pi is unchanged, that is our solution

If the environment is known fully, the situation is simple as we see above. In case the environment is not known or understood, we need to utilize Reinforcement Learning.