Estimating the value of the Percolation threshold via Monte Carlo simulation in R

This problem appeared as an exercise in the Coursera Course Algorithm-I (By Prof. ROBERT SEDGEWICK, Princeton) as an application of the Union-find.


(as defined in

Given a composite systems comprised of randomly distributed insulating and metallic materials: what fraction of the materials need to be metallic so that the composite system is an electrical conductor? Given a porous landscape with water on the surface (or oil below), under what conditions will the water be able to drain through to the bottom (or the oil to gush through to the surface)? Scientists have defined an abstract process known as percolation to model such situations.

The model

A percolation system is modeled using an n-by-n grid of sites. Each site is either open or blocked. A full site is an open site that can be connected to an open site in the top row via a chain of neighboring (left, right, up, down) open sites. The system is said to percolate if there is a full site in the bottom row. In other words, a system percolates if we fill all open sites connected to the top row and that process fills some open site on the bottom row. (For the insulating/metallic materials example, the open sites correspond to metallic materials, so that a system that percolates has a metallic path from top to bottom, with full sites conducting. For the porous substance example, the open sites correspond to empty space through which water might flow, so that a system that percolates lets water fill open sites, flowing from top to bottom.)

The problem

In a famous scientific problem, researchers are interested in the following question: if sites are independently set to be open with probability p (and therefore blocked with probability 1 – p), what is the probability that the system percolates? When p equals 0, the system does not percolate; when p equals 1, the system percolates. The plots below show the site vacancy probability p versus the percolation probability for 20-by-20 random grid (left) and 100-by-100 random grid (right).


When n is sufficiently large, there is a threshold value p* such that when p < p* a random n-by-n grid almost never percolates, and when p > p*, a random n-by-n grid almost always percolates. No mathematical solution for determining the percolation threshold p* has yet been derived. We shall estimate p* with Monte Carlo simulation.

Monte Carlo simulation

To estimate the percolation threshold, let’s consider the following computational experiment:

  1. Initialize all sites to be blocked.
  2. Repeat the following until the system percolates:
  3. Choose a site uniformly at random among all blocked sites.
  4. Open the site.
  5. The fraction of sites that are opened when the system percolates provides an estimate of the percolation threshold.
  • By repeating this computation experiment T times and averaging the results, we obtain a more accurate estimate of the percolation threshold. Let x_t be the fraction of open sites in computational experiment t. The sample mean provides an estimate of the percolation threshold; the sample standard deviation s measures the sharpness of the threshold.
  • Assuming T is sufficiently large (say, at least 30, by CLT), we can construct a 95% confidence interval for the percolation threshold as follows.


Union Find

This problem can be solved efficiently by using the (weighted) quick union-find algorithm. When a new site is opened, we can use union to add the site to some existing adjacent open sites. Each time we can check if the top rows are connected to the bottom rows and determine if we reached the percolation threshold p*. The below figure explains the data structures and algorithms used.


The following figure shows the connected sites at the percolation threshold for different grid size.

The following animation shows how the system evolves till the point the percolation threshold is reached.


Finally the following figures show how the average percolation threshold and the corresponding 95% confidence interval varies with different n using the Monte-Carlo Simulation with T=50. As can be seen, the higher the n is, the narrower the confidence interval, the more certain we are about the value of the percolation threshold.



Erdos-Renyi Random Graph, the Giant Component and Connectivity

  • The following figure shows the similar results for the E-R random graphs. As the probability p increases and reaches a threshold around 1/n a giant component appears the graph. Also, an E-R random graph becomes almost surely connected at the probability threshold ln(n)/n.


  • The following animation shows how the giant component appears for an E-R random graph with n=100 as the probability p increases.


  • The following figures show the results of a Monte-Carlo Simulation (with T=100) which shows that there is narrow region around the probability threshold p below which G(n,p) is not connected a.s. and above which G(n,p) is connected with probability 1 a.s., for n=100 (shown in the log-scale too).

More theories regarding Random Graphs can be found here.

Google Page Rank and the impact of the Second EigenValue of the Google Matrix on Power Iteration Convergence in R

(Sandipan Dey, 2 Jan 2017)

  • In this article, the basic concepts of the Google Page Rank Algorithm will be discussed along with a few examples.
  • Web net can be represented as a directed graph, with nodes represented by web pages and edges represented by the links between them.
  • Probabilistic point of view: Since the importance of a web page is measured by its popularity (how many incoming links it has), the importance of page i can be viewed as the probability that a random surfer on the Internet that opens a browser to any page and starts following hyperlinks, visits the page i.
  • The process can be modeled as a random walk on the web graph.
  • A smaller, but positive percentage of the time, the surfer will dump the current page and choose arbitrarily a different page from the web and “teleport” there. The damping factor p reflects the probability.
  • The PageRank vector for a web graph with transition matrix A, and damping factor p, is the unique probabilistic eigenvector of the matrix M (as shown in the following figure), corresponding to the dominant eigenvalue 1.
  • By Perron-Frobenius theorem, if the underlying web-graph is connected and aperiodic, then the power-iteration algorithm used to compute the page ranks is guaranteed to converge to a steady state vector, which is precisely the vector with the page ranks of all the nodes.


  • The next figure and animations show an example problem for computing page ranks that appeared in the final exam of the same edX (CornellX) course Networks, Crowds and Markets.


  • The following animation shows the pageranks computed using the power-iteration algorithm without a damping factor. As can be seen, the nodes with larger sizes have higher pageranks.


  • The next animation shows the pageranks computed using the power-iteration algorithm with a damping factor 0.2.


  • The next animations show the pageranks computed for an Erdos-Renyi random graph with n=100 nodes, with probability p=1/n using the power-iteration algorithm and also directly as a dominant eigenvector of the matrix M. As can be seen, the nodes with ids in larger fonts have higher pageranks.


  • The next animation shows how fast the page ranks computed by the power-iteration algorithm converges to the true page ranks (computed directly as normalized dominant eigenvector) for the same E-R random graph.


  • The power-iteration algorithm convergence rate is directly proportional to the ratio of the dominant eigenvalues λ2/λ1 of the matrix M. Since λ1=1, the page rank computation algorithm converges faster if the second eigenvalue is smaller.
  • The next figures show the result of an experiment done starting with an Erdos-Renyi random graph with n=100 nodes, with probability p=1/2 and then iteratively removing 10 edges randomly from the graph and running the power iteration algorithm and recording the # iterations to converge with an error at most 1e-5, also noting the λ2/λ1 each time.
  • As can be seen, with ratio of the dominant eigenvalues (or λ2) decreases as we go on removing the edges and as expected, #iterations taken to converge (with the same accuracy threshold) decreases (which means a faster convergence) as the second eigenvalue becomes smaller.