This article describes a simple approach to solve the **4-puzzle** problem with **Reinforcement-learning** (using** Q-learning**). Not all instances of 4-puzzle problem are solvable by only shifting the space (represented by **0**). Let’s aim at solving those problem instances only with **model-free-methods**.

#### The Markov Decision Process for 4-puzzle problem

Consider the following **24-state MDP** for the 4-puzzle problem, each state being encoded as a flattened string, with the **goal state** being ‘**0123**‘.

For each state except the **goal state**, we have exactly two valid **actions** from the set of actions {‘**Up**‘, ‘**Down**‘, ‘**Left**‘, ‘**Right**‘}. The goal state is the exception which has an additional **action** ‘**STAY**‘ that allows self transition.

The only actions that have a positive **reward** of **+100** are (‘**2103**‘, ‘**UP**‘), (‘**1023**‘, ‘**LEFT**‘) and (‘**1023**‘, ‘**STAY**‘), as shown in the following figure: all the single-step actions that can transition to the **goal state** from any state. All other actions get a **0** reward from all other states.

Also notice that there are two connected components in the graph with the **MDP** states as shown in the following figure, there is a component from which the goal state is not reachable. Let’s consider running **Q-learning** starting from any state belonging to the other component, one that contains the **goal state**.

The following **Epsilon-Greedy Q-learning** algorithm is used **with exploration**, with ϵ=0.9 (and γ=0.8) to learn the **Q matrix**, which will later be used to solve a given puzzle instance.

The following animations and the figures show the **Q matrix** learnt after **815** episodes.

The following figure shows the scaled final **Q matrix** where the maximum value is kept at **100**.

**norms**of the

**Q matrices**learnt at successive

**episodes**change over episodes and becomes almost zero at

**convergence**for a long enough sequence of episodes.

**Solving a puzzle instance**

The following figure shows the values from the **Q matrix** learnt by the **Q-learning** algorithm. As expected, the values learnt increase as the states come near to the **goal state** and decrease as the states go far away from the **goal state**. This will lead the agent choose the right path towards the goal state.

**4-puzzle**instance ‘

**0312′**using the

**Q matrix**learnt. We have to just follow the steps below to solve the puzzle using the

**Q matrix**as a look-up table:

**current state = ‘0312’**

**best action**at the

**current state**from the

**Q matrix**(the action with the highest value).

**action**to

**transition**to the

**next state**from the

**current state.**

**next state == goal state**then

**stop**else

**go to step 1**.

The next animation shows the solution using the **Q matrix **learnt.