The following problem appeared as a project in the coursera course Algorithmic Thinking (by RICE university), a part of Fundamentals of Computing specialization. The following description of the problem is taken directly from the project description.
In this problem, the mathematical analysis to analyze a real-world problem: “How do scientific papers get cited” will be combined with simulation. The goal is to provide a more realistic simulation of how the concepts are actually used in practice.
Our task for this application is to analyze the structure of graphs generated by citation patterns from scientific papers. Each scientific paper cites many other papers, say 20-40, and sometimes (e.g., review papers) hundreds of other papers. But, let’s face it: It is often the case that the authors of a paper are superficially familiar with some (many?) of the papers they cite. So, the question is: Are the cited papers chosen randomly (from within the domain of the paper) or is there some “hidden pattern”?
Given that “paper i cites paper j” relationships are going to be examined, it makes sense to represent the citation data as a directed graph (a citation graph) in which the nodes correspond to papers, and there is an edge from node i to node j if the paper corresponding to node i cites the paper corresponding to node j. Since we’re interested in understanding how papers get cited, we shall analyze the in-degree distribution of a specific graph, and contrast it to those of graphs generated by two different random processes.
For this question, the task is to load a provided citation graph for 27,770 high energy physics theory papers. This graph has 352,807 edges. Then we need to compute the in-degree distribution for this citation graph. Once this distribution have been computed, we need to normalize the distribution and then compute a log/log plot of the points in this normalized distribution.
A very small part of the citation graph is shown below:
Loaded graph with 27770 nodes
The following Algorithm ER shows an algorithm for generating random graphs:
Consider the simple modification of the algorithm to generate random directed graphs: For every pair of nodes i and j, the modified algorithm considers the pair twice, once to add an edge (i,j) with probability p, and another to add an edge (j,i) with probability p.
Now, let’s consider the shape of the in-degree distribution for an ER graph and compare its shape to that of the physics citation graph. To determine the shape of the in-degree distribution for an ER graph, we can compute several examples of in-degree distributions (with simulation) or determine the shape mathematically.
The expected in-degree is the same for every node in an ER graph. Since for any given node, incoming edges to it, from each of the n-1 remaining nodes, are chosen by a tossing a coin with the probability of success p, the random variable in-degree follows a Binomial distribution with parameters (n-1,p) and the expected in-degree = (n-1)p. Each node is equivalent in this sense and will have the same expected in-degree.
As can be seen from the next figure (a log-log plot for the in-degree distribution of the ER graph for n = 1000, p = 0.1). It’s centred approximately around in-degree 100. All the nodes have in-degree approximately in between 66 and 132, with steady and sharp increase in probability from in-degree 66 to 100, having the peak probability at around in degree 100, then the probability falling steadily and sharply in between in-degrees 100 and 132.
Also, it can be seen that the shape of the in-degree distribution of the citation graph is different from that of the ER graph. We observe a steady decrease in probability when moving from lower to higher in-degrees in case of citation graph, with pick probabilities (high frequency of nodes) at small values of in degrees.But in case of ER graph there is first a steady increase and then a steady decrease happens, pick probability is attained (high frequency of nodes) at higher in-degree values.
Generating ER random graph with 1000 nodes
Conclusion: Citation graphs are not generated by a purely random process. If they were, we would expect the in-degree distribution for the citation graph to be similar to the in-degree distribution for the ER graphs. However, the distributions for ER graphs are binomial (bump-shaped) while the distribution for the citation graph is almost linear.
We next consider a different process for generating synthetic directed graphs. In this process, a random directed graph is generated iteratively, where in each iteration a new node is created, added to the graph, and connected to a subset of the existing nodes. This subset is chosen based on the in-degrees of the existing nodes. More formally, to generate a random directed graph in this process, the user must specify two parameters: n, which is the final number of nodes, and m (where), which is the number of existing nodes to which a new node is connected during each iteration. Notice that m is fixed throughout the procedure.
The algorithm starts by creating a complete directed graph on m nodes. Then, the algorithm grows the graph by adding n-m nodes, where each new node is connected to nodes randomly chosen from the set of existing nodes. As an existing node may be chosen more than once in an iteration, we eliminate duplicates (to avoid parallel edges). Hence, the new node may be connected to fewer than m existing nodes upon its addition.
Notice that this algorithm is more complex than the ER algorithm. As a result, reasoning about the properties of the graphs that it generates analytically is not as simple. When such a scenario arises, we can implement the algorithm, run it, produce graphs, and visually inspect their in-degree distributions. In general, this is a powerful technique: When analytical solutions to systems are very hard to derive, we can simulate the systems and generate data that can be analyzed to understand the properties of the systems.
For the construction of the DPA graph, let’s use the following parameters:
n = number of nodes (vertices) in the citation graph = 27770
m = integer part of the average out-degree of the citation graph.
Next the DPA algorithm is implemented and a DPA graph is computed using the above values of the parameters, and then the in-degree distribution for this DPA graph is plotted.
Generating DPA random graph with 27770 nodes
As can be seen from above, both the in-degree distribution for the DPA graph and the citation graph follow the similar patterns. Both of them show high probabilities when in-degree is small and a steady decrease in probability with higher in-degrees with similar negative slopes (although the DPA graph is more steeper and linear in nature, than the citation graph which shows slightly non-linear fall in slopes in the logscale as the in-degree increases, but the trend is somewhat similar). Both the graphs show that fewer nodes have higher in-degrees.
The phenomenon “rich gets richer” mimics the behaviour of the DPA process. When more nodes join the graph network, with higher probability they select among a few of nodes as the neighbours that are already rich (i.e., they already have high in-degrees), thereby increasing their in-degrees again. The majority of the nodes that have low in-degrees they tend not to get selected as neighbours with high probability in this process, hence their in-degrees do not increase. This process modeled by the algorithm DPA mimics the rich gets richer model, but is also used to explain the six degrees of separation phenomenon.
Same phenomenon “rich gets richer” can explain the citation graph network structure as well. Few of the papers that are heavily cited (rich, with high in-degrees) becomes an important paper and tend to get cited again with high probability, when a new paper is written it cites those few heavily cited papers again thereby increasing the in-degree of them (making them richer) and it cites the majority of papers with low citations (low importance) with low probability, so they are likely to remain not cited again.