# Some Reinforcement Learning: Using Policy & Value Iteration and Q-learning for a Markov Decision Process in Python and R

The following problems appeared as a project in the edX course ColumbiaX: CSMM.102x Machine Learning. Here is the description of the problem:

Let’s  consider the following 3-state MDP for a robot trying to walk, the three states being ‘Fallen‘, ‘Standing‘ and ‘Moving‘, as shown in the following figure.

(1) First use policy iteration method to find optimal policy for discount factor γ=0.1. Try using this method with a different discount factor, for example a much larger discount factor like 0.9 or 0.99, or a much smaller one like 0.01. Does the optimal policy change?

(2) Now, try using value iteration method instead. How does the number of iterations change? What about the complexity of implementing each iteration?

(3) Implement Q-learning method (without exploration) for this simple MDP. Run this script several times. Does the algorithm always converge?

(4) Now, try adding a simple epsilon greedy exploration and rerun the q-learning method.

The following theory is going to be used for implementation of the algorithms:

(1) The policy iteration method was implemented in python, where starting from the policy {Slow, Slow, Slow}, policy evaluation and improvement steps were applied iteratively.
As can be seen from the following results, a change in optimal policy happened at higher value of the discount factor γγ. The following figures (R visualizations) show how the optimal value vector changes with different values of γ .

#### Policy Iteration: changes in the value vector corresponding to the optimal policy

(2) The value iteration method was implemented similarly, it took much higher number of iterations to converge than the policy iteration method, although it individual iterations were much less complex.
As can be seen from the following results, a change in optimal policy happened at higher value of the discount factor γ . The following figures show how the optimal value vector changes with different values of γ.
Also, as can be seen, an increase in γγ (for which the expected future rewards became more important) resulted in a corresponding increase in the #iterations for the convergence of the value-iteration algorithm convergence (e.g., for γ=0.99γ=0.99 the value iteration method took more than 1000 iterations to converge).
(3) Q-learning without exploration: The algorithm was implemented by using weighted random sampling from the set of states, with the weights proportional to the corresponding probabilities, for any (state, action) at each time step, starting from the ‘Fallen‘ state at time step 0.For convergence, the algorithm was run until the difference in the norm of the learnt Q matrices in two successive iterations became < 1e6   and for at least 20 time steps.

As expected, without exploration, the ‘Fast‘ action in the ‘Moving‘ state (although actually with higher expected reward) does not get a chance to be explored ever and it always converges with the optimal policy as ‘Slow‘ action for all the states. With higher γγ, the number of iterations to converge becomes higher and the value for (Moving, Slow) in the Q matrix learnt becomes higher at convergence.

The following figures and animations show the Q matrix learnt at convergence for different values of γ.

(4) ϵ greedy Q-learning with exploration: with probability equal to ϵ, a random state was explored next, along with the greedy transition. As expected, with exploration, the ‘Fast‘ action at the Moving state also gets explored sometimes (the frequency of exploration depends on the value of ϵϵ chosen) and sometimes it converges with the optimal policy as ‘Slow‘ action for the states ‘Fallen‘ & ‘Standing‘ and ‘Fast‘ action for the ‘Moving‘ state, as expected.

The following figures and animations show two sample runs of the epslion greedy Q-learning with ϵ=0.1, each one converging at a different optimal policy.

#### Results with ϵ=0.1

Since different runs converged at different policies with ϵ=0.1, the algorithm was run 100 times and the frequencies of the unique Q matrix learnt (with values rounded to 1 decimal place) was noted, as shown in the following figure.

1. For 4 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for the states ‘Fallen‘ & ‘Standing‘ and ‘Fast‘ action for the ‘Moving‘ state.
2. For 92 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for all the states.
3. For 4 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for the states ‘Fallen‘ & ‘Moving‘ and ‘Fast‘ action for the ‘Standing‘ state.
4. For 45 out of 100 cases the ‘Fast‘ action at the ‘Moving‘ state is still unexplored.

The above results suggest that ϵ=0.1 is probably not exploring enough and its value should be increased.

#### Results with ϵ=0.5

Again the algorithm was run 100 times and the frequencies of the unique Q matrix learnt (with values rounded to 1 decimal place) was noted, as shown in the following figure.

1. For 33 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for the states ‘Fallen‘ & ‘Standing‘ and ‘Fast‘ action for the ‘Moving‘ state.
2. For 37 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for all the states.
3. For 15 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for the states ‘Fallen‘ & ‘Moving‘ and ‘Fast‘ action for the ‘Standing‘ state.
4. For 15 out of 100 cases the optimal policy learnt by the Q matrix was ‘Fast‘ action for the states ‘Standing‘ & ‘Moving‘ and ‘Slow‘ action for the ‘Fallen‘ state.