This article describes a simple approach to solve the 4-puzzle problem with Reinforcement-learning (using Q-learning). Not all instances of 4-puzzle problem are solvable by only shifting the space (represented by 0). Let’s aim at solving those problem instances only with model-free-methods.
The Markov Decision Process for 4-puzzle problem
Consider the following 24-state MDP for the 4-puzzle problem, each state being encoded as a flattened string, with the goal state being ‘0123‘.
For each state except the goal state, we have exactly two valid actions from the set of actions {‘Up‘, ‘Down‘, ‘Left‘, ‘Right‘}. The goal state is the exception which has an additional action ‘STAY‘ that allows self transition.
The only actions that have a positive reward of +100 are (‘2103‘, ‘UP‘), (‘1023‘, ‘LEFT‘) and (‘1023‘, ‘STAY‘), as shown in the following figure: all the single-step actions that can transition to the goal state from any state. All other actions get a 0 reward from all other states.
Also notice that there are two connected components in the graph with the MDP states as shown in the following figure, there is a component from which the goal state is not reachable. Let’s consider running Q-learning starting from any state belonging to the other component, one that contains the goal state.
The following Epsilon-Greedy Q-learning algorithm is used with exploration, with ϵ=0.9 (and γ=0.8) to learn the Q matrix, which will later be used to solve a given puzzle instance.
The following animations and the figures show the Q matrix learnt after 815 episodes.
The following figure shows the scaled final Q matrix where the maximum value is kept at 100.
The following figure shows the values from the Q matrix learnt by the Q-learning algorithm. As expected, the values learnt increase as the states come near to the goal state and decrease as the states go far away from the goal state. This will lead the agent choose the right path towards the goal state.
The next animation shows the solution using the Q matrix learnt.
Reblogged this on ANONYMOUS WORLDWIDE UNITED.
LikeLike