Using Monte-Carlo Simulation to play TicTacToe with Machine Players in Python

The following problem appeared in one of the mini-projects in the coursera course Principles of Computing 1 which is a part of the Specialization Fundamentals of Computing, taught by RICE university. In this article the problem is slightly modified as follows (the most part of the problem description is taken from the project description from the course itself, except the modification part):

TicTacToe (Monte Carlo)


TicTacToe is a simple children’s game played on a grid. Players alternate turns placing an “X” or an “O” on an empty grid square. The first player to get three in a row wins. If a player knows the appropriate strategy and the opponent does not, he cannot lose the game. Further, if both players understand the appropriate strategy the game will always end in a tie. An interesting variant of the game is “reverse” TicTacToe in which a player loses if he get three in arow. The game is also more interesting if it’s played on larger square grids.

In this article, a couple of machine players for TicTacToe will be implemented. Specifically, each of the machine players will use a Monte Carlo simulation to decide its next move. Although the game is played on a 3×3 grid, the implemented version should be able to handle any square grid.

Machine Player Strategy

Each of the machine players should use a Monte Carlo simulation to choose the next move from a given TicTacToe board position. The general idea is to play a collection of games with random moves starting from the position, and then use the results of these games to compute a good move. When a paritular machine player wins one of these random games, it wants to favor the squares in which it played (in hope of choosing a winning move) and avoid the squares in which the opponent played. Conversely, when it loses one of these random games, it wants to favor the squares in which the opponent played (to block its opponent) and avoid the squares in which it played. In short, squares in which the winning player played in these random games should be favored over squares in which the losing player played. Both the players in this case will be the machine players.

Here is an outline of this simulation:

  1. Start with the current board given (let’s call it board).
  2. Repeat for the desired number of trials:
    A. Set the current board to be board.
    B. Play an entire game on this board by just randomly choosing a move for each player.
    C. Score the resulting board.
    D. Add the scores to a running total across all trials.
  3. To select a move, randomly choose one of the empty squares on the board that has the maximum score.

These steps should be relatively straightforward except for step 2C, scoring the board. Scores are kept for each square on the board. The way to assign a score to a square depends on who won the game. If the game was a tie, then all squares should receive a score of 0, since that game will not help a player determine a winning strategy. If the current player (the player for which the code is currently selecting a move) won the game, each square that matches the current player should get a positive score and each square that matches the other player should get a negative score. Conversely, if the current player lost the game, each square that matches the current player should get a negative score and and each square that matches the other player should get a positive score. All empty squares should get a score of 0.

Note that it’s needed to select a final move from the total scores across all trials. So, in step 2D, all the scores from 2C are added to the running total of all scores. Higher scores indicate squares that are more likely to be played by the current player in winning games and lower scores indicate squares that are more likely to be played in losing games.


The following animations show how each of the machine players learns to play the game just by using Monte-Carlo Simulation with 10 trials (a small number of trials) at every state of the board, the scores shown at the right bottom of each grid square are used by each of the players at their corresponding turns, to choose its next move (the brighter cells represent better moves for the current player, as per the simulation results). Three different games are shown with 3 possible results.


The following game results in a tie.
As it’s intuitively clear, there is a trade-off between the learning accuracy (learn as many configurations as possible, so that it never loses, in turn none of the opponent machine players should lose, so every game should end in a tie in the perfect case) vs. learning efficiency (time to learn).

Ideally a machine player should play a large number of random games starting from the current state of the board to be more confident about the winning moves it needs to take, but it needs to use more and more trials for Monte-Carlo Simulation, which will increase its response time and the overall time to finish the game.

For each of the following different number of MC trials (from low = 5 to high = 5000), 100 different TicTacToe games are played in between the 2 machine players. The results obtained are shown below, they show the trade-off clearly. As the number of trials in Monte-Carlo Simulation increases,

  1. Number of ties increases.
  2. Total time to finish a game increases on average.

With a large number of trials, as large as 5000 each step, all the 100 games end in ties (both the machine players know enough from their experiences in the simulation, so they are highly likely not to lose ever), but it needs total time ~ 3-4 seconds on average to play each game.












Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s