# Solving 4-puzzle with Reinforcement Learning (Q-learning) in Python

This article describes a simple approach to solve the 4-puzzle problem with Reinforcement-learning (using Q-learning). Not all instances of 4-puzzle problem are solvable by only shifting the space (represented by 0). Let’s aim at solving those problem instances only with model-free-methods.

#### The Markov Decision Process for 4-puzzle problem

Consider the following 24-state MDP for the 4-puzzle problem, each state being encoded as a flattened string, with the goal state being ‘0123‘.

For each state except the goal state, we have exactly two valid actions from the set of actions {‘Up‘, ‘Down‘, ‘Left‘, ‘Right‘}. The goal state is the exception which has an additional actionSTAY‘ that allows self transition.

The only actions that have a positive reward of +100 are (‘2103‘, ‘UP‘), (‘1023‘, ‘LEFT‘) and (‘1023‘, ‘STAY‘), as shown in the following figure: all the single-step actions that can transition to the goal state from any state. All other actions get a 0 reward from all other states.

Also notice that there are two connected components in the graph with the MDP states as shown in the following figure, there is a component from which the goal state is not reachable. Let’s consider running Q-learning starting from any state belonging to the other component, one that contains the goal state.

The following Epsilon-Greedy Q-learning algorithm is used with exploration, with ϵ=0.9 (and γ=0.8) to learn the Q matrix, which will later be used to solve a given puzzle instance.

The following animations and the figures show the Q matrix learnt after 815 episodes.

The following figure shows the scaled final Q matrix where the maximum value is kept at 100.

The following figure shows how the difference between the norms of the Q matrices learnt at successive episodes change over episodes and becomes almost zero at convergence for a long enough sequence of episodes.
Solving a puzzle instance

The following figure shows the values from the Q matrix learnt by the Q-learning algorithm. As expected, the values learnt increase as the states come near to the goal state and decrease as the states go far away from the goal state. This will lead the agent choose the right path towards the goal state.

Now, after the Q-matrix is learnt, now let’s solve a 4-puzzle instance ‘0312′ using the Q matrix learnt. We have to just follow the steps below to solve the puzzle using the Q matrix as a look-up table:
1. Set current state = ‘0312’
2. Find the best action at the current state from the Q matrix (the action with the highest value).
3. Use the action to transition to the next state from the current state.
4. If next state == goal state then stop else go to step 1.

The next animation shows the solution using the Q matrix learnt.