The following problems appeared as a project in the *edX course ColumbiaX: CSMM.102x Machine Learning*. Here is the description of the problem:

Let’s consider the following **3-state MDP** for a robot trying to walk, the three states being ‘**Fallen**‘, ‘**Standing**‘ and ‘**Moving**‘, as shown in the following figure.

**policy iteration**method to find optimal policy for discount factor γ=0.1. Try using this method with a different discount factor, for example a much larger discount factor like 0.9 or 0.99, or a much smaller one like 0.01. Does the optimal policy change?

(2) Now, try using **value iteration** method instead. How does the number of iterations change? What about the complexity of implementing each iteration?

(3) Implement **Q-learning** method (without exploration) for this simple **MDP**. Run this script several times. Does the algorithm always converge?

(4) Now, try adding a simple **epsilon greedy exploration** and rerun the **q-learning** method.

The following theory is going to be used for implementation of the algorithms:

**policy iteration**method was implemented in

*python*, where starting from the policy {Slow, Slow, Slow}, policy evaluation and improvement steps were applied iteratively.

**optimal policy**happened at higher value of the discount factor γγ. The following figures (

*R*visualizations) show how the

**optimal value vector**changes with different values of γ .

#### Policy Iteration: changes in the value vector corresponding to the optimal policy

**value iteration**method was implemented similarly, it took much higher number of iterations to converge than the

**policy iteration**method, although it individual iterations were much less complex.

**optimal policy**happened at higher value of the discount factor γ . The following figures show how the

**optimal value vector**changes with different values of γ.

**Q-learning**

**without**

**exploration**: The algorithm was implemented by using weighted random sampling from the set of states, with the weights proportional to the corresponding probabilities, for any

*(state, action)*at each time step, starting from the ‘

**Fallen**‘ state at time step 0.For

**convergence**, the algorithm was run until the difference in the

**norm**of the learnt

**Q matrices**in two successive iterations became < 1e−6 and for at least 20 time steps.

As expected, without exploration, the ‘**Fast**‘ action in the ‘**Moving**‘ state (although actually with higher expected reward) does not get a chance to be explored ever and it always converges with the **optimal policy** as ‘**Slow**‘ action for all the states. With higher γγ, the number of iterations to converge becomes higher and the value for **(Moving, Slow)** in the **Q** matrix learnt becomes higher at convergence.

The following figures and animations show the **Q matrix** learnt at convergence for different values of γ.

(4) **ϵ ****greedy** **Q-learning ****with exploration**: with probability equal to ϵ, a random state was explored next, along with the greedy transition. As expected, with exploration, the ‘**Fast**‘ action at the **Moving** state also gets explored sometimes (the frequency of exploration depends on the value of ϵϵ chosen) and sometimes it converges with the **optimal policy** as ‘**Slow**‘ action for the states ‘**Fallen**‘ & ‘**Standing**‘ and ‘**Fast**‘ action for the ‘**Moving**‘ state, as expected.

The following figures and animations show two sample runs of the **epslion greedy** **Q-learning** with ϵ=0.1, each one converging at a different **optimal policy**.

#### Results with ϵ=0.1

Since different runs converged at different policies with ϵ=0.1, the algorithm was run 100 times and the frequencies of the unique **Q matrix** learnt (with values rounded to 1 decimal place) was noted, as shown in the following figure.

- For 4 out of 100 cases the
**optimal policy**learnt by the Q matrix was ‘**Slow**‘ action for the states ‘**Fallen**‘ & ‘**Standing**‘ and ‘**Fast**‘ action for the ‘**Moving**‘ state. - For 92 out of 100 cases the
**optimal policy**learnt by the Q matrix was ‘**Slow**‘ action for all the states. - For 4 out of 100 cases the
**optimal policy**learnt by the Q matrix was ‘**Slow**‘ action for the states ‘**Fallen**‘ & ‘**Moving**‘ and ‘**Fast**‘ action for the ‘**Standing**‘ state. - For 45 out of 100 cases the ‘
**Fast**‘ action at the ‘**Moving**‘ state is still unexplored.

The above results suggest that ϵ=0.1 is probably not exploring enough and its value should be increased.

#### Results with ϵ=0.5

Again the algorithm was run 100 times and the frequencies of the unique Q matrix learnt (with values rounded to 1 decimal place) was noted, as shown in the following figure.

- For 33 out of 100 cases the optimal policy learnt by the Q matrix was ‘
**Slow**‘ action for the states ‘**Fallen**‘ & ‘**Standing**‘ and ‘**Fast**‘ action for the ‘**Moving**‘ state. - For 37 out of 100 cases the optimal policy learnt by the Q matrix was ‘
**Slow**‘ action for all the states. - For 15 out of 100 cases the optimal policy learnt by the Q matrix was ‘
**Slow**‘ action for the states ‘**Fallen**‘ & ‘**Moving**‘ and ‘**Fast**‘ action for the ‘**Standing**‘ state. - For 15 out of 100 cases the optimal policy learnt by the Q matrix was ‘
**Fast**‘ action for the states ‘**Standing**‘ & ‘**Moving**‘ and ‘**Slow**‘ action for the ‘**Fallen**‘ state.