The following problem appeared as a project in the *coursera course Algorithmic Thinking (by RICE university)*, a part of Fundamentals *of Computing* specialization. The following description of the problem is taken directly from the project description.

In this problem, the mathematical analysis to analyze a real-world problem: “How do scientific papers get cited” will be combined with simulation. The goal is to provide a more realistic simulation of how the concepts are actually used in practice.

### Citation graphs

Our task for this application is to analyze the structure of graphs generated by citation patterns from scientific papers. Each scientific paper cites many other papers, say *20-40*, and sometimes (e.g., review papers) hundreds of other papers. But, let’s face it: It is often the case that the authors of a paper are superficially familiar with some (many?) of the papers they cite. So, the question is: Are the cited papers chosen randomly (from within the domain of the paper) or is there some “hidden pattern”?

Given that “paper *i* cites paper *j*” relationships are going to be examined, it makes sense to represent the citation data as a directed graph (*a citation graph*) in which the nodes correspond to papers, and there is an edge from node i to node j if the paper corresponding to node i cites the paper corresponding to node j. Since we’re interested in understanding how papers get cited, we shall analyze the *in-degree distribution* of a specific graph, and contrast it to those of graphs generated by two different *random processes*.

For this question, the task is to load a provided citation graph for *27,770 high energy physics theory* papers. This graph has *352,807* edges. Then we need to compute the *in-degree distribution* for this *citation graph*. Once this distribution have been computed, we need to *normalize* the distribution and then compute a log/log plot of the points in this normalized distribution.

A very small part of the *citation graph* is shown below:

The following *Algorithm ER* shows an algorithm for generating random graphs:

Consider the simple modification of the algorithm to generate random directed graphs: For every pair of nodes *i* and *j*, the modified algorithm considers the pair twice, once to add an edge *(i,j)* with probability *p*, and another to add an edge *(j,i)* with probability *p*.

Now, let’s consider the shape of the *in-degree distribution* for an *ER graph* and compare its shape to that of the *physics citation graph*. To determine the shape of the *in-degree distribution* for an *ER graph*, we can compute several examples of in-degree distributions (with simulation) or determine the shape mathematically.

The *expected in-degree* is the same for *every node* in an *ER graph*. Since for any given node, incoming edges to it, from each of the *n-1* remaining nodes, are chosen by a tossing a coin with the probability of success *p*, the random variable *in-degree* follows a *Binomial distribution* with parameters *(n-1,p)* and the expected in-degree *= (n-1)p*. Each node is equivalent in this sense and will have the *same expected in-degree*.

As can be seen from the next figure (a log-log plot for the *in-degree distribution* of the *ER graph* for *n = 1000*, *p = 0.1*). It’s centred approximately around *in-degree 100*. All the nodes have *in-degree* approximately in between *66* and *132*, with steady and sharp increase in probability from in-degree 66 to 100, having the peak probability at around in degree 100, then the probability falling steadily and sharply in between in-degrees 100 and 132.

Also, it can be seen that the shape of the *in-degree distribution* of the *citation graph* is *different* from that of the *ER graph*. We observe a steady decrease in probability when moving from lower to higher in-degrees in case of citation graph, with pick probabilities (high frequency of nodes) at small values of in degrees.But in case of ER graph there is first a steady increase and then a steady decrease happens, pick probability is attained (high frequency of nodes) at higher in-degree values.