Some more Social Network Analysis with Python: Centrality, PageRank/HITS, Random Network Generation Models, Link Prediction

In this article, some more social networking concepts will be illustrated with a few problems. The problems appeared in the programming assignments in the
coursera course Applied Social Network Analysis in Python.  The descriptions of the problems are taken from the assignments. The analysis is done using NetworkX.

The following theory is going to be used to solve the assignment problems.

 

f01.png

1. Measures of  Centrality

In this assignment, we explore measures of centrality on the following two networks:

  1. A friendship network
  2. A political blog network.

 

  • First let’s do some analysis with different centrality measures using the friendship network, which is a network of friendships at a university department. Each node corresponds to a person, and an edge indicates friendship. The following figure visualizes the network, with the size of the nodes proportional to the degree of the nodes.

 

f33.png

Degree Centrality

The next figure shows the distribution of the degree centrality of the nodes in the friendship network graph.

f34.png
The following figure visualizes the network, with the size of the nodes again
proportional to the degree centrality of the nodes.
f35.png

Closeness Centrality

The next figure shows the distribution of the closeness centrality of the nodes in the friendship network graph.

f36.png

 

The following figure again visualizes the network, with the size of the nodes being
proportional to the closeness centrality of the nodes.

f37.png

 

 

Betweenness Centrality

The next figure shows the distribution of the betweenness centrality of the nodes in the friendship network graph.

 

f38.png

 

The following figure again visualizes the network, with the size of the nodes being
proportional to the betweenness centrality of the nodes.

 

f39.png

 

  • Now, let’s consider another network, which is a directed network of political blogs, where nodes correspond to a blog and edges correspond to links between blogs.
  • This network will be used to compute  the following:
    • PageRank of the nodes using random walk with damping factor.
    • Authority and Hub Score of the nodes using the HITS.
  • The blog network looks like the following:
    source value
    tsrightdominion.blogspot.com Blogarama 1
    rightrainbow.com Blogarama 1
    gregpalast.com LabeledManually 0
    younglibs.com Blogarama 0
    blotts.org/polilog Blogarama 0
    marylandpolitics.blogspot.com BlogCatalog 1
    blogitics.com eTalkingHead 0
    thesakeofargument.com Blogarama 1
    joebrent.blogspot.com Blogarama 0
    thesiliconmind.blogspot.com Blogarama 0
    40ozblog.blogspot.com Blogarama,BlogCatalog 0
    progressivetrail.org/blog.shtml Blogarama 0
    randomjottings.net eTalkingHead 1
    sonsoftherepublic.com Blogarama 1
    rightvoices.com CampaignLine 1
    84rules.blog-city.com eTalkingHead 1
    blogs.salon.com/0002874 LeftyDirectory 0
    alvintostig.typepad.com eTalkingHead 0
    notgeniuses.com CampaignLine 0
    democratreport.blogspot.com BlogCatalog 0
  • First let’s visualize the network, next figure shows the visualization. The network has nodes (blogs) from 47 different unique sources, each node belonging to a source is colored with a unique color. The gray lines denote the edges (links) between the nodes (blogs).

f40.png

  • The next figure shows the same network graph, but without the node labels (blog urls).

f41.png

Page-Rank and HITS

  • Now, let’s apply the Scaled Page Rank Algorithm to this network.,with damping value 0.85. The following figure visualizes the graph with the node size proportional to the page rank of the node.f42.png
  • The next animations show how the page rank of the nodes in the network changes with the first 25 iterations of the power-iteration algorithm.

 

anipranipr1

  • The top 5 nodes with the highest page rank values after 25 iterations of the power-iteration page-rank algorithm are shown below, along with their ranks scores.
    • (u’dailykos.com’, 0.020416972184975967)
    • (u’atrios.blogspot.com’, 0.019232277371918939)
    • (u’instapundit.com’, 0.015715941717833914)
    • (u’talkingpointsmemo.com’, 0.0152621016868163)
    • (u’washingtonmonthly.com’, 0.013848910355057181)

 

  • The top 10 nodes with the highest page rank values after the convergence of the  page-rank algorithm are shown below, along with their ranks scores.
    • (u’dailykos.com’, 0.01790144388519838)
    • (u’atrios.blogspot.com’, 0.015178631721614688)
    • (u’instapundit.com’, 0.01262709066072975)
    • (u’blogsforbush.com’, 0.012508582138399093)
    • (u’talkingpointsmemo.com’, 0.012393033204751035)
    • (u’michellemalkin.com’, 0.010918873519905312)
    • (u’drudgereport.com’, 0.010712415519898428)
    • (u’washingtonmonthly.com’, 0.010512012551452737)
    • (u’powerlineblog.com’, 0.008939228332543162)
    • (u’andrewsullivan.com’, 0.00860773822610682)

 

  • The following figure shows the distribution of the page-ranks of the nodes after the convergence of the algorithm.f43.png

 

  • Next, let’s apply the HITS Algorithm to the network to find the hub and authority scores of node.
  • The following couple of figures visualizes the network graph where the node size is proportional to the hub score and the authority score for the node respectively, once the HITS algorithm converges.f44f45
  • Next couple of figures show the distribution of the hub-scores and the authority-scores for the nodes once the HITS converges.

 

  • f46f47

 

 

2. Random Graph Identification

 

Given a list containing 5 networkx graphs., with each of these graphs being generated by one of three possible algorithms:

  • Preferential Attachment ('PA')
  • Small World with low probability of rewiring ('SW_L')
  • Small World with high probability of rewiring ('SW_H')

Anaylze each of the 5 graphs and determine which of the three algorithms generated the graph.

The following figures visualize all the graphs along with their degree-distributions. Since the Preferential Attachment model generates graphs with the node degrees following  the power-law distribution (since rich gets richer), the graphs with this pattern in their degree distributions are most likely generated by this model.

On the other hand, the Small World model generates graphs with the node degrees not following the power-law distribution, instead the distribution shows fat tails. If this pattern is seen, we can identify the network as to be generated with this model.
f23f24f25f26f27f28f29f30f31f32

 

3. Prediction using Machine Learning models with the graph-centrality and local clustering features

 

For this assignment we need to work with a company’s email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes Department and ManagmentSalary.

Department indicates the department in the company which the person belongs to, and ManagmentSalary indicates whether that person is receiving a managment position salary. The email-network graph has

  • Number of nodes: 1005
  • Number of edges: 16706
  • Average degree: 33.2458

The following figure visualizes the email network graph.

f48.png

Salary Prediction

Using network G, identify the people in the network with missing values for the node attribute ManagementSalary and predict whether or not these individuals are receiving a managment position salary.

To accomplish this, we shall need to

  • Create a matrix of node features
  • Train a (sklearn) classifier on nodes that have ManagementSalary data and
  • Predict a probability of the node receiving a managment salary for nodes where ManagementSalary is missing.

The predictions will need to be given as the probability that the corresponding employee is receiving a managment position salary.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

A model needs to achieve an AUC of 0.75 on the test dataset.

Using the trained classifier, return a series with the data being the probability of receiving managment salary, and the index being the node id (from the test dataset).

The next table shows first few rows of the dataset with the degree and clustering features computed. The dataset contains a few ManagementSalary values are missing (NAN), the corresponding tuples form the test dataset, for which we need to predict the missing ManagementSalary values. The rest will be the training dataset.

Now,  a few classifiers will be trained on the training dataset , they are:

  • RandomForest
  • SVM
  • GradientBoosting

with 3 input features for each node:

  • Department
  • Clustering (local clustering coefficient for the node)
  • Degree

in order to predict the output variable as the indicator receiving  managment salary.

 

 Index Department ManagementSalary clustering degree
0 1 0.0 0.276423 44
1 1 NaN 0.265306 52
2 21 NaN 0.297803 95
3 21 1.0 0.384910 71
4 21 1.0 0.318691 96
5 25 NaN 0.107002 171
6 25 1.0 0.155183 115
7 14 0.0 0.287785 72
8 14 NaN 0.447059 37
9 14 0.0 0.425320 40

 

Typical 70-30 validation is used for model selection. The next 3 tables show the first few rows of the train, validation and the test datasets respectively.

 

Department clustering degree ManagementSalary
421 14 0.227755 52 0
972 15 0.000000 2 0
322 17 0.578462 28 0
431 37 0.426877 25 0
506 14 0.282514 63 0
634 21 0.000000 3 0
130 0 0.342857 37 0
140 17 0.394062 41 0
338 13 0.350820 63 0
117 6 0.274510 20 0
114 10 0.279137 142 1
869 7 0.733333 6 0
593 5 0.368177 42 0
925 10 0.794118 19 1
566 14 0.450216 22 0
572 4 0.391304 26 0
16 34 0.284709 74 0
58 21 0.294256 126 1
57 21 0.415385 67 1
207 4 0.505291 30 0

 

 Index Department clustering degree ManagementSalary
952 15 0.533333 8 0
859 32 0.388106 74 1
490 6 0.451220 43 0
98 16 0.525692 25 0
273 17 0.502463 31 0
784 28 0.000000 3 0
750 20 0.000000 1 0
328 8 0.432749 21 0
411 28 0.208364 106 1
908 5 0.566154 28 0
679 29 0.424837 20 0
905 1 0.821429 10 0
453 6 0.427419 34 1
956 14 0.485714 15 0
816 13 0.476190 23 0
127 6 0.341270 28 0
699 14 0.452899 26 0
711 21 0.000000 2 0
123 13 0.365419 36 0
243 19 0.334118 53 0
Department clustering degree
1 1 0.265306 52
2 21 0.297803 95
5 25 0.107002 171
8 14 0.447059 37
14 4 0.215784 80
18 1 0.301188 56
27 11 0.368852 63
30 11 0.402797 68
31 11 0.412234 50
34 11 0.637931 31

The following table shows the first few predicted probabilities  by  the RandomForest classifier on the test dataset.

 Index 0 1
0 1.0 0.0
1 0.2 0.8
2 0.0 1.0
3 1.0 0.0
4 0.5 0.5
5 1.0 0.0
6 0.7 0.3
7 0.7 0.3
8 0.3 0.7
9 0.7 0.3
10 0.9 0.1
11 0.8 0.2
12 1.0 0.0
13 0.6 0.4
14 0.7 0.3
15 0.5 0.5
16 0.0 1.0
17 0.2 0.8
18 1.0 0.0
19 1.0 0.0

 

The next figure shows the ROC curve to compare the performances (AUC) of the classifiers on the validation dataset.

As can be seen, the GradientBoosting  classifier performs the best (has the highest AUC on the validation dataset).

 

f49.png

 

The following figure shows the Decision Surface for the Salary Prediction learnt by the RandomForest Classifier.

f50.png

 

4. New Connections Prediction (Link Prediction with ML models)

For the last part of this assignment, we shall predict future connections between employees of the network. The future connections information has been loaded into the variable future_connections. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the FutureConnectioncolumn indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection. The next table shows first few rows of the dataset.

 Index Future Connection
(6, 840) 0.0
(4, 197) 0.0
(620, 979) 0.0
(519, 872) 0.0
(382, 423) 0.0
(97, 226) 1.0
(349, 905) 0.0
(429, 860) 0.0
(309, 989) 0.0
(468, 880) 0.0

 

Using network G and future_connections, identify the edges  in future_connections
with missing values and predict whether or not these edges will have a future connection.

To accomplish this, we shall need to

  1. Create a matrix of features for the edges found in future_connections
  2. Train a (sklearn) classifier on those edges in future_connections that have Future Connection data
  3. Predict a probability of the edge being a future connection for those edges in future_connections where Future Connection is missing.

The predictions will need to be given as the probability of the corresponding edge being a future connection.

The evaluation metric for this assignment is again the Area Under the ROC Curve (AUC).

Using the trained classifier, return a series with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

Now,  a couple of classifiers will be trained on the training dataset , they are:

  • RandomForest
  • GradientBoosting

with 2 input features for each edge:

  • Preferential attachment
  • Common Neighbors

in order to predict the output variable Future Connection.

The next table shows first few rows of the dataset with the preferential attachment and Common Neighbors  features computed.

 Index Future Connection preferential attachment Common Neighbors
(6, 840) 0.0 2070 9
(4, 197) 0.0 3552 2
(620, 979) 0.0 28 0
(519, 872) 0.0 299 2
(382, 423) 0.0 205 0
(97, 226) 1.0 1575 4
(349, 905) 0.0 240 0
(429, 860) 0.0 816 0
(309, 989) 0.0 184 0
(468, 880) 0.0 672 1
(228, 715) 0.0 110 0
(397, 488) 0.0 418 0
(253, 570) 0.0 882 0
(435, 791) 0.0 96 1
(711, 737) 0.0 6 0
(263, 884) 0.0 192 0
(342, 473) 1.0 8140 12
(523, 941) 0.0 19 0
(157, 189) 0.0 6004 5
(542, 757) 0.0 90 0

 

Again, typical 70-30 validation is used for model selection. The next 3 tables show the first few rows of the train, validation and the test datasets respectively.

 

preferential attachment Common Neighbors Future Connection
(7, 793) 360 0 0
(171, 311) 1620 1 0
(548, 651) 684 2 0
(364, 988) 18 0 0
(217, 981) 648 0 0
(73, 398) 124 0 0
(284, 837) 132 1 0
(748, 771) 272 4 0
(79, 838) 88 0 0
(207, 716) 90 1 1
(270, 928) 15 0 0
(201, 762) 57 0 0
(593, 620) 168 1 0
(494, 533) 18212 40 1
(70, 995) 18 0 0
(867, 997) 12 0 0
(437, 752) 205 0 0
(442, 650) 28 0 0
(341, 900) 351 0 0
(471, 684) 28 0 0

 

 Index preferential attachment Common Neighbors Future Connection
(225, 382) 150 0 0
(219, 444) 594 0 0
(911, 999) 3 0 0
(363, 668) 57 0 0
(161, 612) 2408 4 0
(98, 519) 575 0 0
(59, 623) 636 0 0
(373, 408) 2576 6 0
(948, 981) 27 0 0
(690, 759) 44 0 0
(54, 618) 765 0 0
(149, 865) 330 0 0
(562, 1001) 320 1 1
(84, 195) 4884 10 1
(421, 766) 260 0 0
(599, 632) 70 0 0
(814, 893) 10 0 0
(386, 704) 24 0 0
(294, 709) 75 0 0
(164, 840) 864 3 0

 

preferential attachment Common Neighbors
(107, 348) 884 2
(542, 751) 126 0
(20, 426) 4440 10
(50, 989) 68 0
(942, 986) 6 0
(324, 857) 76 0
(13, 710) 3600 6
(19, 271) 5040 6
(319, 878) 48 0
(659, 707) 120 0

The next figure shows the ROC curve to compare the performances (AUC) of the classifiers on the validation dataset.

As can be seen, the GradientBoosting  classifier again performs the best (has the highest AUC on the validation dataset).

f51.png

 

The following figure shows the Decision Boundary for the Link Prediction learnt by the RandomForest Classifier.

f52.png

Advertisements

Some Social Network Analysis with Python

The following problems appeared in the programming assignments in the coursera course Applied Social Network Analysis in Python.  The descriptions of the problems are taken from the assignments. The analysis is done using NetworkX.

The following theory is going to be used to solve the assignment problems.

f0.png

1. Creating and Manipulating Graphs

  • Eight employees at a small company were asked to choose 3 movies that they would most enjoy watching for the upcoming company movie night. These choices are stored in a text file Employee_Movie_Choices , the following figure shows the content of the file.
     Index Employee Movie
    0 Andy Anaconda
    1 Andy Mean Girls
    2 Andy The Matrix
    3 Claude Anaconda
    4 Claude Monty Python and the Holy Grail
    5 Claude Snakes on a Plane
    6 Frida The Matrix
    7 Frida The Shawshank Redemption
    8 Frida The Social Network
    9 Georgia Anaconda
    10 Georgia Monty Python and the Holy Grail
    11 Georgia Snakes on a Plane
    12 Joan Forrest Gump
    13 Joan Kung Fu Panda
    14 Joan Mean Girls
    15 Lee Forrest Gump
    16 Lee Kung Fu Panda
    17 Lee Mean Girls
    18 Pablo The Dark Knight
    19 Pablo The Matrix
    20 Pablo The Shawshank Redemption
    21 Vincent The Godfather
    22 Vincent The Shawshank Redemption
    23 Vincent The Social Network
  • A second text file, Employee_Relationships, has data on the relationships between different coworkers.  The relationship score has value of -100 (Enemies) to +100 (Best Friends). A value of zero means the two employees haven’t interacted or are indifferent. The next figure shows the content of this file.

 

 Index Employee1 Employee2 Score
0 Andy Claude 0
1 Andy Frida 20
2 Andy Georgia -10
3 Andy Joan 30
4 Andy Lee -10
5 Andy Pablo -10
6 Andy Vincent 20
7 Claude Frida 0
8 Claude Georgia 90
9 Claude Joan 0
10 Claude Lee 0
11 Claude Pablo 10
12 Claude Vincent 0
13 Frida Georgia 0
14 Frida Joan 0
15 Frida Lee 0
16 Frida Pablo 50
17 Frida Vincent 60
18 Georgia Joan 0
19 Georgia Lee 10
20 Georgia Pablo 0
21 Georgia Vincent 0
22 Joan Lee 70
23 Joan Pablo 0
24 Joan Vincent 10
25 Lee Pablo 0
26 Lee Vincent 0
27 Pablo Vincent -20
0 Claude Andy 0
1 Frida Andy 20
2 Georgia Andy -10
3 Joan Andy 30
4 Lee Andy -10
5 Pablo Andy -10
6 Vincent Andy 20
7 Frida Claude 0
8 Georgia Claude 90
9 Joan Claude 0
10 Lee Claude 0
11 Pablo Claude 10
12 Vincent Claude 0
13 Georgia Frida 0
14 Joan Frida 0
15 Lee Frida 0
16 Pablo Frida 50
17 Vincent Frida 60
18 Joan Georgia 0
19 Lee Georgia 10
20 Pablo Georgia 0
21 Vincent Georgia 0
22 Lee Joan 70
23 Pablo Joan 0
24 Vincent Joan 10
25 Pablo Lee 0
26 Vincent Lee 0
27 Vincent Pablo -20
  • First, let’s load the bipartite graph from Employee_Movie_Choices file, the following figure visualizes the graph. The blue nodes represent the employees and the red nodes represent the movies.

f1.png

 

  • The following figure shows yet another visualization of the same graph, this time with a different layout.f2.png
  • Now, let’s find a weighted projection of the graph which tells us how many movies different pairs of employees have in common. We need to compute an L-bipartite projection for this, the projected graph is shown in the next figure.f3.png
  • The following figure shows the same projected graph with another layout and with weights. For example, Lee and Joan has the weight 3 for their connecting edges, since they share 3 movies in common as their movie-choices.f4.png

 

  • Next, let’s load the graph from Employee_Relationships  file, the following figure visualizes the graph. The nodes represent the employees and the edge colors and widths (weights) represent the relationships. The green edges denote friendship, the red edges enmity and blue edges neutral relations. Also, the thicker an edge is, the more powerful is a +ve or a -ve relation.f5.png
  • Suppose we like to find out if people that have a high relationship score also like the same types of movies.
  • Let’s find the Pearson correlation between employee relationship scores and the number of movies they have in common. If two employees have no movies in common it should be treated as a 0, not a missing value, and should be included in the correlation calculation.
  • The following data frame is created from the graphs and will be used to compute the correlation.
     Index Employee1 Employee2 Relationship Score Weight in Projected Graph
    0 Andy Claude 0 1.0
    1 Andy Frida 20 1.0
    2 Andy Georgia -10 1.0
    3 Andy Joan 30 1.0
    4 Andy Lee -10 1.0
    5 Andy Pablo -10 1.0
    6 Andy Vincent 20 0.0
    7 Claude Frida 0 0.0
    8 Claude Georgia 90 3.0
    9 Claude Joan 0 0.0
    10 Claude Lee 0 0.0
    11 Claude Pablo 10 0.0
    12 Claude Vincent 0 0.0
    13 Frida Georgia 0 0.0
    14 Frida Joan 0 0.0
    15 Frida Lee 0 0.0
    16 Frida Pablo 50 2.0
    17 Frida Vincent 60 2.0
    18 Georgia Joan 0 0.0
    19 Georgia Lee 10 0.0
    20 Georgia Pablo 0 0.0
    21 Georgia Vincent 0 0.0
    22 Joan Lee 70 3.0
    23 Joan Pablo 0 0.0
    24 Joan Vincent 10 0.0
    25 Lee Pablo 0 0.0
    26 Lee Vincent 0 0.0
    27 Pablo Vincent -20 1.0
    28 Claude Andy 0 1.0
    29 Frida Andy 20 1.0
    30 Georgia Andy -10 1.0
    31 Joan Andy 30 1.0
    32 Lee Andy -10 1.0
    33 Pablo Andy -10 1.0
    34 Vincent Andy 20 0.0
    35 Frida Claude 0 0.0
    36 Georgia Claude 90 3.0
    37 Joan Claude 0 0.0
    38 Lee Claude 0 0.0
    39 Pablo Claude 10 0.0
    40 Vincent Claude 0 0.0
    41 Georgia Frida 0 0.0
    42 Joan Frida 0 0.0
    43 Lee Frida 0 0.0
    44 Pablo Frida 50 2.0
    45 Vincent Frida 60 2.0
    46 Joan Georgia 0 0.0
    47 Lee Georgia 10 0.0
    48 Pablo Georgia 0 0.0
    49 Vincent Georgia 0 0.0
    50 Lee Joan 70 3.0
    51 Pablo Joan 0 0.0
    52 Vincent Joan 10 0.0
    53 Pablo Lee 0 0.0
    54 Vincent Lee 0 0.0
    55 Vincent Pablo -20 1.0
  • The correlation score is 0.788396222173 which is a pretty strong correlation. The following figure shows the association between the two variables with a fitted regression line.

    f6.png 

 

2. Network Connectivity

 

  • In this assignment we shall go through the process of importing and analyzing an internal email communication network between employees of a mid-sized manufacturing company.

    Each node represents an employee and each directed edge between two nodes represents an individual email. The left node represents the sender and the right node represents the recipient, as shown in the next figure.

    f7.png

  •  First let’s load the email-network as a directed multigraph and visualize the graph in the next figure. The graph contains 167 nodes (employees) and 82927 edges (emails sent). The size of a node in the figure is proportional to the out-degree of the node.f8.png

 

  • The next couple of figures visualize the same network with different layouts.f9.png

    f10.png

     

  • Assume that information in this company can only be exchanged through email.When an employee sends an email to another employee, a communication channel has been created, allowing the sender to provide information to the receiver, but not vice versa.Based on the emails sent in the data, is it possible for information to go from every employee to every other employee?

    This will only be possible if the graph is strongly connected, but it’s not. The following figure shows 42 strongly-connected components (SCC) of the graph. Each SCC is shown using a distinct color.

     
    f11.png

    As can be seen from the following histogram, only one SCC contains 126 nodes, each of the remaining 41 SCCs contains exactly one node.

    f12.png

  • Now assume that a communication channel established by an email allows information to be exchanged both ways.Based on the emails sent in the data, is it possible for information to go from every employee to every other employee?This is possible since the graph is weakly connected. 
  • The following figure shows the subgraph  induced by the largest SCC with 126 nodes.f13.png
  • The next figure shows the distribution of the (shortest-path) distances between the node-pairs in the largest SCC.f14.png

    As can be seen from above, inside the largest SCC, all  the nodes are reachable from one another with at most 3 hops, the average distance between any node pairs belonging to the SCC being 1.6461587301587302.

  • Diameter of the largest SCC: The largest possible distance between two employees, which is 3.
  • Find the set of nodes in the subgraph with eccentricity equal to the diameter: these are exactly the nodes that are on the periphery. As can be seen from the next figure (the largest component is shown along with few other nodes from some other components in the graph, all the nodes and edges in the graph are not shown to avoid over-crowding), there are exactly 3 nodes on the periphery of the SCC, namely the node 97, 129 and 134.Each of the following 3 shortest paths shown in the next figure
    • 97->14->113->133
    • 129->1->38->132 and
    • 134->1->38->132

    has length equal to the diameter of this SCC.

    f15.png

 

  • Center of the largest SCC: The set of node(s) in the subgraph induced by the largest SCC with eccentricity equal to the radius (which is 1). There is exactly one such node (namely, node 38), as shown in the next figure, all the nodes belonging to the largest SCC are distance-1 reachable from the center node 38 (again, the largest component is shown along with few other nodes from some other components in the graph, all the nodes and edges in the graph are not shown to avoid over-crowding).f16.png
  • The following figure shows the distribution of eccentricity in the largest SCC.f17.png
  • Which node in the largest SCC has the most shortest paths to other nodes whose distance equal the diameter ?  How many of these paths are there?As can be seen from the following figure, the desired node is 97 and there are 63 such shortest paths that have length equal to the diameter of the SCC,  5 of such paths (each with length 3) are shown in the next figure, they are:
    • 97->14->113->133
    • 97->14->113->130
    • 97->14->113->136
    • 97->14->45->131
    • 97->14->45->132
      .f18.png
  • Suppose we want to prevent communication from flowing to the node 97 (the node that has the most shortest paths to other nodes whose distance equal the diameter), from any node in the center of the largest component, what is the smallest number of nodes we would need to remove from the graph (we’re not allowed to remove the node 97 or the center node(s))?

    As obvious, the minimum number of nodes required to be removed exactly equals to the size of the min-cut with the source node (center node 38) to the target node (node 97), shown in red in the next figure. The size of the min-cut is 5 and hence 5 nodes (shown in pale-golden-red color) need to be removed, the corresponding node numbers are: 14, 32, 37, 45 and 46. As done in the earlier figures, all the nodes and edges in the email-net graph are not shown to avoid over-crowding.

    emails_comp5.png

    The min-cut is separately shown in the following figure from source node 38 to target node 97.

    emails_comp4

  • Construct an undirected graph from the subgraph induced by the largest component on the email-net directed multi-graph. 

    The next figure shows the undirected graph constructed. As before, the node size is proportional to the degree of the node.f19.png

  • What is the transitivity and average clustering coefficient of the undirected graph?The transitivity and average clustering coefficient of the undirected graph are 0.570111160700385 and 0.6975272437231418 respectively.

    The following figure shows the distribution of the local clustering coefficients.

    f20.png

    The following figure shows the undirected graph, this time the node size being proportional to the local clustering coefficient of the node.

    f21.png

    The next figure shows the degree distribution of the undirected graph.

    f22.png

    Since there are more nodes with lower degrees than with higher degrees and the transitivity weights the nodes with higher degree more, the undirected graph has  lower transitivity and higher average clustering coefficient.

Implementing kd-tree for fast range-search, nearest-neighbor search and k-nearest-neighbor search algorithms in 2D (with applications in simulating the flocking boids: modeling the motion of a flock of birds and in learning a kNN classifier: a supervised ML model for binary classification) in Java and python

The following problem appeared as an assignment in the coursera course Algorithms-I by Prof. Robert Sedgewick from the Princeton University few years back (and also in the course cos226 offered at Princeton). The problem definition and the description is taken from the course website and lectures.  The original assignment was to be done in java, where in this article both the java and a corresponding python implementation will also be described.

  • Use a 2d-tree to support
    • efficient range search (find all of the points contained in a query rectangle)
    • nearest-neighbor search (find a closest point to a query point).

    2d-trees have numerous applications, ranging from classifying astronomical objects to computer animation to speeding up neural networks to mining data to image retrieval. The figure below describes the problem:

    kdtree-ops.png

    2d-tree implementation: A 2d-tree is a generalization of a BST to two-dimensional keys. The idea is to build a BST with points in the nodes, using the x– and y-coordinates of the points as keys in strictly alternating sequence, starting with the x-coordinates, as shown in the next figure.

    kdtree3

    • Search and insert. The algorithms for search and insert are similar to those for BSTs, but at the root we use the x-coordinate (if the point to be inserted has a smaller x-coordinate than the point at the root, go left; otherwise go right); then at the next level, we use the y-coordinate (if the point to be inserted has a smaller y-coordinate than the point in the node, go left; otherwise go right); then at the next level the x-coordinate, and so forth.
    • The prime advantage of a 2d-tree over a BST is that it supports efficient implementation of range search and nearest-neighbor search. Each node corresponds to an axis-aligned rectangle, which encloses all of the points in its subtree. The root corresponds to the entire plane [(−∞, −∞), (+∞, +∞ )]; the left and right children of the root correspond to the two rectangles split by the x-coordinate of the point at the root; and so forth.
      • Range search: To find all points contained in a given query rectangle, start at the root and recursively search for points in both subtrees using the following pruning rule: if the query rectangle does not intersect the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a subtree only if it might contain a point contained in the query rectangle.
      • Nearest-neighbor search: To find a closest point to a given query point, start at the root and recursively search in both subtrees using the following pruning rule: if the closest point discovered so far is closer than the distance between the query point and the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a node only if it might contain a point that is closer than the best one found so far. The effectiveness of the pruning rule depends on quickly finding a nearby point. To do this, organize the recursive method so that when there are two possible subtrees to go down, you choose first the subtree that is on the same side of the splitting line as the query point; the closest point found while exploring the first subtree may enable pruning of the second subtree.
      • k-nearest neighbors search: This method returns the k points that are closest to the query point (in any order); return all n points in the data structure if n ≤ k. It must do this in an efficient manner, i.e. using the technique from kd-tree nearest neighbor search, not from brute force.
      • BoidSimulator: Once the  k-nearest neighbors search we can simulate boids: how a flock of birds flies together and a hawk predates. Behold their flocking majesty.The following figures show the theory that are going to be used, taken from the lecture slides of the same course.

    f1f2f3f4

    f5

    Results

    The following figures and animations show how the 2-d-tree is grown with recursive space-partioning for a few sample datasets.

  • Circle 10 dataset

circle10

test1test2test3test4test6test7test8test9

  • Circle 100 dataset

circle100.gif

test1test6test16test32test64test72test99
The following figure shows the result of the range search algorithm on the same dataset after the 2d-tree is grown. The yellow points are the points found by the algorithm inside the query rectangle shown.

test-range

The next animations show the nearest neighbor search algorithm for a given query point (the fixed white point with black border: the point (0.3, 0.9)) and how the the branches are traversed and the points (nodes) are visited in the 2-d-tree until the nearest neighbor is found.

nn.gif

nearest1.gif

The next animation shows how the kd-tree is traversed for nearest-neighbor search for a different query point (0.04, 0.7).

nearest2

The next figures show the result of k-nearest-neighbor search, by extending the previous algorithm with different values of k (15, 10, 5 respectively).

test_knn2test_knn3test_knn4

 

Runtime of the algorithms with a few datasets in Python

As can be seen from the next figure, the time complexity of 2-d tree building (insertion), nearest neighbor search and k-nearest neighbor query depend  not only on the size of the datasets but also on the geometry of the datasets .

runtime.png

 

Flocking Boids simulator

Finally the flocking boids simulator is implemented with 2-d-trees and the following 2 animations (java and python respectively) shows how the flock of birds fly together, the black / white ones are the boids and the red one is the predator hawk.

birds

birds1

 

Implementing a kNN Classifier with kd tree from scratch

Training phase

Build a 2d-tree from a labeled 2D training dataset (points marked with red or blue represent 2 different class labels).

Testing phase

  • For a query point (new test point with unknown class label) run k-nearest neighbor search on the 2d-tree with the query point (for a fixed value of k, e.g., 3).
  • Take a majority vote on the class labels of the k-nearest neighbors (with known class labels) obtained by querying the 2d-tree. Label the query point with the class label that majority of its neighbors have.
  • Repeat for different values of k.

 

The following figures show how the kd tree built can be used to classify (randomly generated) 2D datasets and the decision boundaries are learnt with k=3, 5 and 10 respectively.

test_classification3test_classification5test_classification10

Using Particle Filter for localization: tracking objects from noisy measurements in 2D (in R)

In this article, we shall see how the Particle Filter can be used to predict positions of some moving objects using a few sampled particles in 2D.  This article is inspired by a programming assignment from the coursera course Robotics Learning by University of Pennsylvania. The same algorithm is often used for self-localization of a robot from the noisy sensor measurements.

The basic idea for the Particle Filter is described in the figure below, taken from  this video by Bert Huang.

p1.png

 

The next few figures, taken from the lectures of the Coursera Course Robotics Learning again describe the basics of particle filter.

p2.png
p3p4

  • The particle filter algorithm is described in the next figure, taken from a lecture video by udacity.p5

 

  • The next set of figures / animations show how the position of a moving bug is tracked using Particle Filter.
    • First the noisy measurements of the positions of the bug are obtained at different time instants.
    • Then M=100 particles are sampled and later they are propagated  based upon some prior assumptions on position uncertainty..
    • Next the noisy measurement for the particle is computed (simulated by adding random noise to the true position).
    • Measurement update step: For each particle, the probability of the particle is calculated to update the particle weights, assuming some measurement noise.
    • Next the best particle is chosen to update the pose.
    • Re-sampling step:  if the effective number of particles is smaller than a threshold (e.g., 0.8),  the particles are re-sampled.
    • Next the particles along with the probability-weights are used to compute the estimated location (e.g., computing the weighted average of the particle positions).
    • The next animation shows the steps, the probability cloud represents the density over the particles chosen at each step.motion2
    • The position of the bug as shown in the animation above is moving in  the x and direction randomly in a grid defined by  the rectangle
      [-100,100]x[-100,100].
    • The above figure also shows how at different iterations the Particle Filter predicts the position of the bug.
    • The next animation shows the density of the particles for a different measurement uncertainty.

motion1.gif