This note is a part of my Zettelkasten. What is below might not be complete or accurate. It
is also likely to change often.
This is a review of my notes on Reinforcement Learning from CS6300 Artificial Intelligence.
The goal of Reinforcement Learning is to use observations to learn an optimal policy in a static, Markovian Environment where the agent doesnt know the environment or the results of its actions. The agent faces a definite but unknown MDP.
Reinforcement Learning can be one of two types: Model based or model free. In model based, we try to estimate the transition probabilities and rewards function like in a regular MDP. In model free, we try to calculate the Utilities and Policy of the states directly.
There are two modes of reinforcement learning:
The goal of the passive reinforcement learning is to understand the model using observations of an agent, which follows a fixed policy throughout the learning process.
We learn the model through the observations.
This is a type of model-free learning where we calculate the utility of a state using the sum of rewards upto to the end state, at the end of all observations.
v(s) = (Sum of rewards at s)/(number of times visited s)
This will give the right solution but is very wasteful, since it ignores all other datapoints except the rewards
TD Learning is a model-free learning technique where we learn V(s) from every observed transition (s, a, s, r). The key insight is that the implicit transition probabilities will weight the contribution from each s' propotional to how likely that outcome is.
For given policy pi, repeat for every (s,a,s',r):
V(s) = (1-alpha) . V(s) + alpha (r + gamma.V(s'))
where alpha is the learning rate (which can be a function of time in order to converge on a solution) and gamma is the discounting factor standand in bellman's equations.
The goal of active reinforcement learning is to be in active control of an agent while learning and solving the policy.
An agent must choose between exploration and exploitation. There is a tradeoff.
One way to solve the exploration vs exploration dilemma is to force exploration through random actions till we have an optimal policy
One problem with this is that randomly choosing is not that great. We should choose to get most amount of information. That can be done with the exploration function.
The exploration function should prioritize any states which havn't been visited much. It should converge with the actual utility function as time passes.
Exploration function f(u,n) = u + k/n
where u is best utility estimate, k is a constant and n is number of times a state has been visited
Q-Learning is similar to TD-learning but using Q(s,a) instead of using V(S).
Q(s,a) = (1 - alpha).Q(s,a) + alpha[R(s,a,s') + gamma.maxa'Q(s',a')]
adding on the exploration function
Q(s,a) = (1 - alpha).Q(s,a) + alpha[R(s,a,s') + gamma.maxa'f(Q(s',a'),n)]
Very similar to Q-learning except that it uses the actual observations to calculate the Q-value instead of using maxa'.
Q(s,a) = (1 - alpha).Q(s,a) + alpha[R(s,a,s') + gamma.Q(s',a')]
In most real world situations, it is impossible to handle calculations of all possible states. It might be easier to look at modelling the world using continuous features.
V(a) = w0 + w1.f1(s) + w2.f2(s) ...Q(s,a) = w0 + w1.f1(s) + w2.f2(s) ...
If features are chosen appropriately, experience can be summed up very succinctly. This allows good generalization from learned to unlearned states.
Use functional approximation to estimate the Q(s,a) values in order to learn. This scales much better if we get good features.