Analyzing the Structure of the Citation Graphs using In-Degree Distribution in Python

The following problem appeared as a project in the coursera course Algorithmic Thinking (by RICE university), a part of Fundamentals of Computing specialization. The following description of the problem is taken directly from the project description.

In this problem, the mathematical analysis to analyze a real-world problem: “How do scientific papers get cited” will be combined with simulation. The goal is to provide a more realistic simulation of how the concepts are actually used in practice.

Citation graphs

Our task for this application is to analyze the structure of graphs generated by citation patterns from scientific papers. Each scientific paper cites many other papers, say 20-40, and sometimes (e.g., review papers) hundreds of other papers. But, let’s face it: It is often the case that the authors of a paper are superficially familiar with some (many?) of the papers they cite. So, the question is: Are the cited papers chosen randomly (from within the domain of the paper) or is there some “hidden pattern”?

Given that “paper i cites paper j” relationships are going to be examined, it makes sense to represent the citation data as a directed graph (a citation graph) in which the nodes correspond to papers, and there is an edge from node i to node j if the paper corresponding to node i cites the paper corresponding to node j. Since we’re interested in understanding how papers get cited, we shall analyze the in-degree distribution of a specific graph, and contrast it to those of graphs generated by two different random processes.

For this question, the task is to load a provided citation graph for 27,770 high energy physics theory papers. This graph has 352,807 edges. Then we need to compute the in-degree distribution for this citation graph. Once this distribution have been computed, we need to normalize the distribution and then compute a log/log plot of the points in this normalized distribution.

A very small part of the citation graph is shown below:


Loaded graph with 27770 nodes



avg_out_degree: 12.7046092906

The following Algorithm ER shows an algorithm for generating random graphs:


Consider the simple modification of the algorithm to generate random directed graphs: For every pair of nodes i and j, the modified algorithm considers the pair twice, once to add an edge (i,j) with probability p, and another to add an edge (j,i) with probability p.

Now, let’s consider the shape of the in-degree distribution for an ER graph and compare its shape to that of the physics citation graph. To determine the shape of the in-degree distribution for an ER graph, we can compute several examples of in-degree distributions (with simulation) or determine the shape mathematically.

The expected in-degree is the same for every node in an ER graph. Since for any given node, incoming edges to it, from each of the n-1 remaining nodes, are chosen by a tossing a coin with the probability of success p, the random variable in-degree follows a Binomial distribution with parameters (n-1,p) and the expected in-degree = (n-1)p. Each node is equivalent in this sense and will have the same expected in-degree.

As can be seen from the next figure (a log-log plot for the in-degree distribution of the ER graph for n = 1000, p = 0.1). It’s centred approximately around in-degree 100. All the nodes have in-degree approximately in between 66 and 132, with steady and sharp increase in probability from in-degree 66 to 100, having the peak probability at around in degree 100, then the probability falling steadily and sharply in between in-degrees 100 and 132.

Also, it can be seen that the shape of the in-degree distribution of the citation graph is different from that of the ER graph. We observe a steady decrease in probability when moving from lower to higher in-degrees in case of citation graph, with pick probabilities (high frequency of nodes) at small values of in degrees.But in case of ER graph there is first a steady increase and then a steady decrease happens, pick probability is attained (high frequency of nodes) at higher in-degree values.

Generating ER random graph with 1000 nodes


Conclusion: Citation graphs are not generated by a purely random process. If they were, we would expect the in-degree distribution for the citation graph to be similar to the in-degree distribution for the ER graphs. However, the distributions for ER graphs are binomial (bump-shaped) while the distribution for the citation graph is almost linear.

We next consider a different process for generating synthetic directed graphs. In this process, a random directed graph is generated iteratively, where in each iteration a new node is created, added to the graph, and connected to a subset of the existing nodes. This subset is chosen based on the in-degrees of the existing nodes. More formally, to generate a random directed graph in this process, the user must specify two parameters: n, which is the final number of nodes, and m (where mnm≤n), which is the number of existing nodes to which a new node is connected during each iteration. Notice that m is fixed throughout the procedure.

The algorithm starts by creating a complete directed graph on m nodes. Then, the algorithm grows the graph by adding n-m nodes, where each new node is connected to nodes randomly chosen from the set of existing nodes. As an existing node may be chosen more than once in an iteration, we eliminate duplicates (to avoid parallel edges). Hence, the new node may be connected to fewer than m existing nodes upon its addition.


Notice that this algorithm is more complex than the ER algorithm. As a result, reasoning about the properties of the graphs that it generates analytically is not as simple. When such a scenario arises, we can implement the algorithm, run it, produce graphs, and visually inspect their in-degree distributions. In general, this is a powerful technique: When analytical solutions to systems are very hard to derive, we can simulate the systems and generate data that can be analyzed to understand the properties of the systems.

For the construction of the DPA graph, let’s use the following parameters:

n = number of nodes (vertices) in the citation graph = 27770

m = integer part of the average out-degree of the citation graph =12.705=12.

Next the DPA algorithm is implemented and a DPA graph is computed using the above values of the parameters, and then the in-degree distribution for this DPA graph is plotted.

Generating DPA random graph with 27770 nodes


As can be seen from above, both the in-degree distribution for the DPA graph and the citation graph follow the similar patterns. Both of them show high probabilities when in-degree is small and a steady decrease in probability with higher in-degrees with similar negative slopes (although the DPA graph is more steeper and linear in nature, than the citation graph which shows slightly non-linear fall in slopes in the logscale as the in-degree increases, but the trend is somewhat similar). Both the graphs show that fewer nodes have higher in-degrees.

The phenomenon “rich gets richer” mimics the behaviour of the DPA process. When more nodes join the graph network, with higher probability they select among a few of nodes as the neighbours that are already rich (i.e., they already have high in-degrees), thereby increasing their in-degrees again. The majority of the nodes that have low in-degrees they tend not to get selected as neighbours with high probability in this process, hence their in-degrees do not increase. This process modeled by the algorithm DPA mimics the rich gets richer model, but is also used to explain the six degrees of separation phenomenon.

Same phenomenon “rich gets richer” can explain the citation graph network structure as well. Few of the papers that are heavily cited (rich, with high in-degrees) becomes an important paper and tend to get cited again with high probability, when a new paper is written it cites those few heavily cited papers again thereby increasing the in-degree of them (making them richer) and it cites the majority of papers with low citations (low importance) with low probability, so they are likely to remain not cited again.

Modeling Face Images with Nonnegative Matrix Factorization (NMF), Kmeans with Vector Quantization (VQ) and Singular Value Decompostion (SVD) in R

The following face modeling problem was discussed in one of the lectures in the edX course: ColumbiaX: CSMM.102x Machine Learning. For this experiment the CBCL Face Database from the MIT Center For Biological and Computation Learning is used. This dataset contains 6977 face images to be trained and 24045 images to be tested with svm (each image is of size 21×21 pixels), the images in these two files are combined and then the first 10000 images are used for modeling the faces with NMF, Kmeans/VQ and SVD, after that the models are compared.

The following figure shows the first 625 images from the dataset.


k-means (Modeling the images with VQ)

The following figures show the averages of full faces learnt by K-means with different numbers (k=25,64,100) of clusters respectively. As can be seen some of the cluster centroids represent good quality average images, where for some others the average image is very noisy and can not be identified as face at all.





SVD (Modeling images with first k orthonormal vectors V)

SVD Finds the singular value decomposition of the image matrix. As can be seen from the below figure with k=100 images representing the top k singular vectors (from right to left column-wise), the results not interpretable because of negative values and orthogonality constraint.



Using NMF with Squared error objective

The following figures show the iterative coordinate-ascent algorithm that is going to be used.



NMF learns a “parts-based” representation. Each column captures something interpretable. This is a result of the nonnegativity constraint. The following figures show the W matrix learnt by squared-error objective NMF, for different values of k=25, 64, 100 and the k columns of W are shown as images.


Statistical Inference and Modeling for High-throughput Experiments in R

The following inference problems (along with the descriptions) are taken directly from the exercises of the edX Course HarvardX: PH525.3x Statistical Inference and Modeling for High-throughput Experiments.

Inference in Practice Exercises

These exercises will help clarify that p-values are random variables and some of the properties of these p-values. Note that just like the sample average is a random variable because it is based on a random sample, the p-values are based on random variables (sample mean and sample standard deviation for example) and thus it is also a random variable. The following 2 properties of the p-values are mathematically proved in the following figure:

  • The p-values are random variables.
  • Under null-hypothesis the p-values form a uniform distribution.


To see this, let’s see how p-values change when we take different samples. The next table shows the dataset Bodyweight from which

  1. The control and treatment groups each of size 12 are randomly drawn.
  2. Then 2-sample t-test is performed with this groups to compute the p-value.
  3. Steps 1-2 is replicated 10000 times.



The next table shows first few rows of randomly chosen control and treatment groups for a single replication.

control treatment
21.51 19.96
28.14 18.08
24.04 20.81
23.45 22.12
23.68 30.45
19.79 24.96

The following animation shows first few replication steps.


The following figure shows the distribution of the p-values obtained, which is nearly uniform, as expected.


Now, let’s assume that we are testing the effectiveness of 20 diets on mice weight. For each of the 20 diets let’s run an experiment with 10 control mice and 10 treated mice. Assume the null hypothesis that the diet has no effect is true for all 20 diets and that mice weights follow a normal distribution with mean 30 grams and a standard deviation of 2 grams, run a Monte Carlo simulation for one of these studies, to learn about the distribution of the number of p-values that are less than 0.05. Let’s run these 20 experiments 1,000 times and each time save the number of p-values that are less than 0.05.

The following figures show using Monte-Carlo simulations how some of the t-tests reject the null-hypothesis (at 5% level of significance) simply by chance, even though it is true.



The following figure shows how the FWER (Family-wise error rate, i.e., probability of rejecting at least one true null-hypothesis) computed with (Monte-Carlo simulation) increases with the number of multiple hypothesis tests.


The following figures show theoretically how the FWER can be computed:



Now, let’s try to understand the concept of a error controlling procedure. We can think of it as defining a set of instructions, such as “reject all the null hypothesis for for which p-values < 0.0001 or “reject the null hypothesis for the 10 features with smallest p-values”. Then, knowing the p-values are random variables, we use statistical theory to compute how many mistakes, on average, will we make if we follow this procedure. More precisely we commonly bounds on these rates, meaning that we show that they are smaller than some pre-determined value.

We can compute the following different error rates:

  1. The FWER (Family-wise error rate) tells us the probability of having at least one false positive.
  2. The FDR (False discovery rate) is the expected rate of rejected null hypothesis.

Note 1

The FWER and FDR are not procedures but error rates. We will review procedures here and use Monte Carlo simulations to estimate their error rates.

Note 2

We sometimes use the colloquial term “pick genes that” meaning “reject the null hypothesis for genes that.”

Bonferroni Correction Exercises (Bonferonni versus Sidak)

Let’s consider the following figure:




Let’s plot of α/m and 1(1α)^(1/m) for various values of m>1. Which procedure is more conservative (picks less genes, i.e. rejects less null hypothesis): Bonferroni’s or Sidak’s? As can be seen from the next figures, Bonferroni is more conservative.




Some Machine Learning with Python (Contd…)

In this article, a few python scikit learn implementations of a few machine learning problems will be discussed, all of them appeared as Lab exercises in the edX course Microsoft: DAT210x Programming with Python for Data Science. The problem descriptions are taken straightaway from the course itself.


In this exercise, support vector machine classifier will be used to classify UCI’s wheat-seeds dataset.

  1. First, let’s benchmark how long SVM takes to train and predict with SVC relative to how long K-Neighbors takes to train and test.
  2. Then compare the decision boundary plot produced by the two using the wheat dataset.

The following table shows the first few rows of the entire dataset.

area perimeter compactness length width asymmetry groove wheat_type
0 15.26 14.84 0.8710 5.763 3.312 2.221 5.220 kama
1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 kama
2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 kama
3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 kama
4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 kama

As usual, the entire dataset is divided into 2 parts, 70% as training and 30% as test dataset. The first few rows of the training dataset is shown below.

area perimeter compactness length width asymmetry groove
61 11.23 12.63 0.8840 4.902 2.879 2.269 4.703
116 18.96 16.20 0.9077 6.051 3.897 4.334 5.750
154 11.36 13.05 0.8382 5.175 2.755 4.048 5.263
38 14.80 14.52 0.8823 5.656 3.288 3.112 5.309
194 12.11 13.27 0.8639 5.236 2.975 4.132 5.012

The next figure and the results show the performance of the k-nearest neighbor classifier in terms of the training / prediction time and the accuracy of prediction, with

  1. All the features are used for training and prediction.
  2. Only 2 features at a time is used for training and prediction.
KNeighbors Results
5000 Replications Training Time:  1.61899995804
5000 Replication Scoring Time:  2.78100013733
High-Dimensionality Score:  83.607
Max 2D Score:  90.164


Again, the next figure and the results show the performance of the SVM classifier (with linear kernel and slcak C=1) in terms of the training / prediction time and the accuracy of prediction, with

  1. All the features are used for training and prediction.
  2. Only 2 features at a time is used for training and prediction.

As can be seen, both the accuracies (with all the features and maximum accuracy obtained with any 2-features) are higher in case of SVM. Also, as expected, the training for SVM is slower than KNN but the prediction is faster.

SVC Results
5000 Replications Training Time:  3.09300017357
5000 Replication Scoring Time:  1.27099990845
High-Dimensionality Score:  86.885
Max 2D Score:  93.443


The following heatmaps show the accuracies of the two classifiers using different 2D features.

Accuracy (%) of KNN with 2D features


Accuracy (%) of SVM (with linear kernel and slack C=1) with 2D features


Accuracy (%) of SVM (with different kernels and slack variable values) with all features

C 0.001 0.01 0.1 1.0 10.0 100.0 1000.0
linear 57.377 86.885 85.246 86.885 91.803 95.082 93.443
poly 88.525 93.443 90.164 90.164 93.443 93.443 93.443
rbf 29.508 29.508 86.885 86.885 85.246 88.525 86.885

Training time (with replications) for SVM in seconds (with different kernels and slack variable values) with all features

C 0.001 0.01 0.1 1.0 10.0 100.0 1000.0
linear 3.179 2.636 1.942 2.855 6.327 20.649 28.705
poly 3.828 9.994 31.531 65.456 131.638 132.218 130.544
rbf 6.322 5.804 4.543 2.992 3.257 3.212 3.724


Test time (with replications) for SVM in seconds (with different kernels and slack variable values) with all features

C 0.001 0.01 0.1 1.0 10.0 100.0 1000.0
linear 1.764 1.465 1.376 1.264 1.245 1.359 1.192
poly 1.285 1.246 1.255 1.254 1.233 1.226 1.236
rbf 2.336 2.641 2.084 1.683 1.769 1.457 1.454


Handwritten-Digits Classification with SVM

Even though the United States Postal Service, as an organization, was formed in 1971, it traces its roots back to the Post Office Department, an organization formed in 1792 by President Benjamin Franklin. It later evolved into a cabinet-level department in 1872, before finally being transformed into the USPS we know today in 1971, as an agency of the U.S. government.

Back in the day, all mail was hand read and delivered. Even up the turn of the 20th century, antiquated techniques such as the pigeonhole method from colonial times were used for mail-handling. During the 1950’s, the post office started intense research on the coding systems used in many other countries and started down the process of automation. In 1982, the first computer-driven, OCR machine got installed in Los Angeles, and by the end of 1984, over 250 OCRs machines were installed in 118 major mail processing centers across the country and were processing an average of 6,200 pieces of mail per hour.


Nowadays, the Postal Service is one of the world leaders in optical character recognition technology with machines reading nearly +98 percent of all hand-addressed letter mail and +99.5 percent of machine-printed mail, with a single tray sorting machines capable of sorting more than 18 million trays of mail per day.

Let’s train a support vector classifier in a few seconds using machine learning, and compute the classification accuracy and compare with the advertised USPS stats. For this lab, we shall use of the Optical Recognition of Handwritten Digits dataset, provided courtesy of UCI’s Machine Learning Repository.

Train your SVC classifier with the parameters provided, and keep testing until you’re able to beat the classification abilities of the USPS.

Remember how important having a lot of samples is for machine learning? Try tossing out 96% of your samples, and see how it affects the accuracy of your highest accuracy support vector classifier.

Here are the few digits extracted from the training dataset.


The SVM model best hyper-parameters learnt with grid-search cross validation:

Training SVC Classifier...
Best parameters: {'kernel': 'linear', 'C': 0.01, 'gamma': 1e-06}
Best (mean) score: 0.915032679739

The following heatmaps show the grid-search cross validation accuracy on the validation dataset with different kernels with SVM classifier:

Although on the test dataset this does not have a very high score:

Scoring SVC Classifier... Score: 0.854120267261

The following figure shows a few digits picked from the test dataset and classified with the best-tuned SVM model, the ground truth true labels are drawn on top of each digit, the labels of the correctly predicted ones are colored green, the wrongly predicted ones are colored red.


Finally the 1000 th image is predicted with the model, the below figure shows the image along with the prediction result.

1000th test label:  4
1000th test prediction:  [4]


Discovery of Temporal Neighborhoods through Discretization Methods and Markov Model in C++ / Qt

This article describes a few approaches and algorithms for temporal neighborhood discretization from a couple of papers accepted in the conference ICDM (2009) and in the journal IDA (2014), authored by me and my fellow professors Dr. Aryya Gangopadhyay and Dr. Vandana Janeja at UMBC. Almost all of  the texts / figures in this blog post are taken from the papers.

Neighborhood discovery is a precursor to knowledge discovery in complex and large datasets such as Temporal data, which is a sequence of data tuples measured at successive time instances. Hence instead of mining the entire data, we are
interested in dividing the huge data into several smaller intervals of interest which we call as temporal neighborhoods. In this article we propose 4 novel of algorithms to generate temporal neighborhoods through unequal depth discretization:

  1. Similarity-based simple Merging (SMerg)
  2. Markov Stationary distribution based Merging (StMerg)
  3. Greedy Merging (GMerg)
  4. Optimal Merging with Dynamic Programming (OptMerg).

The next figure shows the schematic diagram of the four algorithms:

All the algorithms start by dividing a given time series data into a few (user-chosen) initial number of bins (with equal-frequency binning) and then they proceed merging similar bins (resulting a few possibly unequal-width bins) using different approaches, finally the algorithms terminate if we obtain the user-specified number of output (merged) bins .

Both the algorithms SMerg and STMerg start by setting a Markov Model with the initial bins as the states (each state being summarized using the corresponding bin’s mean and variance) and Transition Matrix computed with the similarity (computed using a few different distribution-similarity computation metrics) in between the bins, as shown in the following figure:


The following distribution-distance measures are used to compute the similarity in between the bins (we assume the temporal data in each of the bins as normally distributed with the bin mean and the variance).

We discuss the results for a one dimensional special case here, however, our approach is generalizable to multidimensional temporal data.

The Markov Transition Matrix is defined as shown in the following figure as a block-diagonal matrix with a given Temporal window size w.

The following figure shows the pre-processing step Init-bin for setting up the Markov Model


  1. Similarity based merging (SMerg)In this approach we iteratively merge highly similar bins (based on a threshold). At every iteration the transition matrix is recomputed since bin population has changed.im2.png

2. Markov Stationary distribution based Merging (StMerg)

Given the initial equal-depth bins and the transition matrix T, we compute the steady-state vector at the stationary distribution, using power-iteration method. Once we have this vector we need to detect the spikes in the vector which indicate the split points such that the data on either side of any split point is very dissimilar and the data within two split points is highly similar. In order to detect these spikes we use Discrete Fourier Transform (DFT) and Inverse Discrete Fourier Transform (IDFT).

The following figure shows how the steady state vector approximate the pattern in the original data for a few synthetically generated and public transportation datasets, thereby achieving data compression (since number of states in the steady state vector =
# output bins obtained after bin merging << # initial bins).



3. Greedy Approach (GMerg)

This algorithm starts with the initial bins and starts merging the most similar adjacent bins right away (does not require a Markov Model). It is greedy in the sense that it chooses the current most similar intervals for merging and hence may find a local optimum instead of obtaining a globally optimal merging.


4. Optimal merging (OPTMerg)

Starting with the initial equal depth intervals (bins), we introduce the notion of optimal merging by defining optimal partitioning,  as described in the following figures:

Next we use dynamic programming to find optimal partition, as shown in the next figure:


The following figures show the optimal partition trees obtained on a few datasets:



Experimental Results with Synthetic and CAT Lab data

The following figures show the output bins obtained for few of the datasets using few of the algorithms and distance measures.

From the regions in the following figures marked (a, b) we can see that the weekend days are clearly different from the weekdays marked as (d, e). Further if we consider a very high level granularity such as the region (c) using StMerg with the KL distance we can entirely demarcate the weekends as compared to the weekdays (the global outliers can be detected, marked by a circle).



In the next figure, temporal interval (a) and (b) obtained after merging the intervals shows that there is an anomaly in the data from 05/12/2009 23 : 03 : 39 to 05/13/2009 00 03 39. We indeed find that speed value recorded by the censor is 20, constant all over the time period, and we verified from CATT lab representative that this is due to a possible ”free flow speed condition” where the sensor may go into a powersave mode if the traffic intensity is very minimal or close to zero.


We also have extended the existing well-known and popular technique SAX (Symbolic Aggregate Approximation) to SAXMerg (where we merge the adjacent bins represented by same symbol) as shown below.


We then compare our results with those obtained using SAXMerg, as shown in the following figures.

It can be seen from the merged output bins from the next figure, the OPTMerg with Mahalanobis distance measure is more robust to local outliers than the SAXMerg is.



Some results with precision and recall:


A Semi-Supervised Classification Algorithm using Markov Chain and Random Walk in R

In this article a semi-supervised classification algorithm implementation will be described using Markov Chains and Random Walks. We have the following 2D circles dataset (with 1000 points) with only 2 points labeled (as shown in the figure, colored red and blue respectively, for all others the labels are unknown, indicated by the color black).

Now the task is to predict the labels of the other (unlabeled) points. From each of the unlabeled points (states) a random walk with Markov transition matrix (computed from the row-stochastic kernelized distance matrix) will be started that will end in one labelled state, which will be an absorbing state in the Markov Chain.


This problem was discussed as an application of Markov Chain in a lecture from the edX course ColumbiaX: CSMM.102x Machine Learning. The following theory is taken straightway from the slides of the same course:


Since here the Markov chain does not satisfy the connectedness condition of the Perron-Frobenius theorem, it will not have a single stationary distribution, instead for each of the unlabeled points the random walk will end in exactly one of the terminal (absorbing) state and we can directly compute (using the above formula, or use power-iteration method for convergence, which will be used in this implementation) the probability on which absorbing state it’s more likely to terminate starting from any unlabeled point and accordingly label the point with the class of the most probable absorbing state in the Markov Chain.

The following animation shows how the power-iteration algorithm (which is run upto 100 iterations) converges to an absorbing state for a given unlabeled point and it gets labeled correctly. The bandwidth b for the Gaussian Kernel used is taken to be 0.01.

As can be seen, with increasing iterations, the probability that the state ends in that particular red absorbing state with state index 323 increases, the length of a bar in the second barplot represents the probability after an iteration and the color represents two absorbing and the other unlabeled states, where the w vector shown contains 1000 states, since the number of datapoints=1000.

Now, for each of the unlabeled datapoints, 500 iterations of the power-iteration algorithm will be used to obtain the convergence to one of the absorbing states, as shown. Each time a new unlabeled (black) point is selected, a random walk is started with the underlying Markov transition matrix and the power-iteration is continued until it terminates to one of the absorbing states with high probability. Since only two absorbing states are there, finally the point will be labeled with the label (red or blue) of the absorbing state where the random walk is likely to terminate with higher probability. The bandwidth b for the Gaussian Kernel used is taken to be 0.01. Off course the result varies with kernel bandwidth. The next animation shows how the unlabeled points are sequentially labeled with the algorithm.


Implementing a MultiClass Bayes Classifier (a Generative Model) with Gaussian Class-conditional Densities in Python

The following problems appeared as a project in the edX course ColumbiaX: CSMM.102x Machine Learning. The following problem description is taken from the course project itself.


The following figure shows a first few rows of a sample training dataset with d=2 dimensions and K=5 class labels.

x1 x2 y
0 5.731453 4.517417 4
1 -0.743516 2.157767 3
2 5.745734 1.205449 1
3 -0.322106 -0.183691 0
4 2.200057 -1.658819 3

The following figure shows how the Bayesian Classifier can be first trained to estimate the Prior and the Class Conditional Gaussian Densities (as a Generative Model) and then using Bayes Rule the Posterior Probabilities for the class labels can be predicted on the new test dataset.


The following figures show the Gaussian class conditional densities learnt from the training dataset with MLE.


The following figure shows the predicted posterior probabilities on the test dataset.

The following figure shows the decision boundaries with the trained classifier on the same 2-D test dataset. As can be seen, since the the covariance matrices of the class conditional Gaussians were not shared (unlike as in the case of Linear Discriminant Analysis), the decision boundaries are non-linear and quite complex.

The following figures show the decision boundaries learnt with the Bayesian Classsfier trained and tested on different sample datasets respectively.


Active Learning using uncertainties in the Posterior Predictive Distribution with Bayesian Linear Ridge Regression in Python

The following problems appeared as a project in the edX course ColumbiaX: CSMM.102x Machine Learning. The following problem description is taken from the course project itself.


This assignment has two parts that we need to implement.


In this part we need to implement the ℓ2-regularized least squares linear regression algorithm (ridge regression). The objective function takes the following form:


The task will be to implement a function that takes in data y and X and outputs w_RR for an arbitrary value of λ.

Let’s use the following equation to compute the weights at a single step, given that we know that the closed-form solution for Ridge Regression exists, as shown in the following figure. We need to be sure to exclude the intercept from L2 penalty, either by scaling the data appropriately or explicitly forcing the corresponding term in the identity matrix to zero.


The following table shows a few rows of the training data along with the intercept term as the last column.

0 1 2 3 4
0 8.34 40.77 1010.84 90.01 1
1 23.64 58.49 1011.40 74.20 1
2 29.74 56.90 1007.15 41.91 1
3 19.07 49.69 1007.22 76.79 1
4 11.80 40.66 1017.13 97.20 1

The following figures show the Ridge coefficient paths with different λ values.



Next task is to implement the active learning procedure. For this problem, we are provided with an arbitrary setting of λ and σ2 and asked to provide with the first 10 locations to be measured from a set D={x}D={x} given a set of measured pairs (y,X). Need to be careful about the sequential evolution of the sets D and (y,X).

The following theory from Bayesian Linear Regression will be used to compute the uncertainty in prediction with the posterior distribution and for active learning the data points from the new dataset for which the model is the most uncertain in prediction and use Bayesian Sequential Posterior Update as shown in the following figures:




The following animation shows the first 10 data points chosen using active learning with different λ and σ^2 parameters:

λ = 1, σ^2 = 2
λ = 2, σ^2 = 3
λ = 10, σ^2 = 5


Probabilistic Matrix Factorization to fill up the Missing User-Ratings for Recommendation with a Generative Model in Python

The following problem appeared as a project in the edX course ColumbiaX: CSMM.102x Machine Learning. The following problem description is taken from the course project itself.


The following figures show how the idea of Matrix Factorization can be extended to Probabilistic Matrix Factorization by assumes Gaussian generative process.



Here is the probabilistic matrix factorization algorithm:

The first few rows of the input dataset ratings are shown below, which is a comma separated file containing the data. Each row contains a three values that correspond in order to: user_index, object_index, rating. There are 1000 users, 100 objects and total 2000 ratings provided by the users.

0 600 10 4
1 573 32 5
2 308 10 5
3 34 47 5
4 614 32 1

The PMF algorithm will factorize the low rank sparse rating matrix into 2 matrices U (user) and V (object) with dimension of latent feature space as d, as shown below:



The following figure shows how the log likelihood value (as shown in the above figure, the MAP objective function in the log scale) increases abruptly with the first few iterations of the PMF algorithm and then gradually converges, as expected:

The following figures and the animation show how the U and V matrices change with #iterations




Finally the following figure and the animation show how the proximity (similarity) of the users and the objects in the latent embedded space change with #iterations, shown for a first few users and objects (the blue points are users and green ones are objects):


The blue points represent users and the orange ones the objects in the next animation.


Some Natural Language Processing: Using Trigram Hidden Markov Models and Viterbi Decoding to Tag Genes in Biological Text in Python

This problem appeared as a programming assignment in the coursera course Natural Language Processing (NLP) by Columbia University. The following description of the problem is taken directly from the description of the assignment.

In this assignment, we need to build a trigram hidden Markov model to identify gene names in biological text. Under this model, the joint probability of a sentence x_1,x_2,,x_n and a tag sequence y_1,y_2,y_n. A basic unigram HMM is shown in the figure below.


Our goal in this assignment is to use Trigram HMM is defined as follows:


where * is a padding symbol that indicates the beginning of a sentence and STOP is a special HMM state indicating the end of a sentence. The task is to implement this probabilistic model and a decoder for finding the most likely tag sequence for new sentences.

A labeled training dataset gene.train, a labeled and unlabeled versions of the development set, gene.key and, and an unlabeled test set gene.test are provided. The labeled files take the format of one word per line with word and tag separated by space and a single blank line separates sentences, e.g.,

Comparison O
with O
alkaline I-GENE
phosphatases I-GENE
and O
nucleotidase I-GENE
Pharmacologic O
aspects O
of O
neonatal O
hyperbilirubinemia O
. O

The following shows a graphical model representation of yet another sentence to be tagged as shown:


An unlabeled test dataset is also provided, the unlabeled file contains only the words of each sentence and will be used to evaluate the performance of the model deveploed.

There are 13796 sentences in the training dataset, whereas the dev dataset contains 509 sentences.

The task consists of identifying gene names within biological text. In this dataset there is one type of entity: gene (GENE). The dataset is adapted from the BioCreAtIvE II shared task (

Here are the steps that are to be followed:

  1. Estimate the parameters of the Trigram HMM from the training data.
  2. Use the parameters learnt to compute the most likely tag sequence using viterbi decoding dynamic programming algorithm.

Compute MLE parameters for HMM


Parameters estimated for HMM

The computed q parameter for HMM Tagger with MLE using the training dataset is shown below.

 ('*', '*', 'I-GENE'): 0.05429109886923746,
 ('*', '*', 'O'): 0.9457089011307626,
 ('*', 'I-GENE', 'I-GENE'): 0.6034712950600801,
 ('*', 'I-GENE', 'O'): 0.3951935914552737,
 ('*', 'I-GENE', 'STOP'): 0.0013351134846461949,
 ('*', 'O', 'I-GENE'): 0.04545106154671572,
 ('*', 'O', 'O'): 0.9543190005365219,
 ('*', 'O', 'STOP'): 0.00022993791676247414,
 ('I-GENE', 'I-GENE', 'I-GENE'): 0.6057704112952732,
 ('I-GENE', 'I-GENE', 'O'): 0.3937794147738899,
 ('I-GENE', 'I-GENE', 'STOP'): 0.0004501739308369143,
 ('I-GENE', 'O', 'I-GENE'): 0.20999759384023098,
 ('I-GENE', 'O', 'O'): 0.6809432146294514,
 ('I-GENE', 'O', 'STOP'): 0.10905919153031761,
 ('O', 'I-GENE', 'I-GENE'): 0.5778575025176234,
 ('O', 'I-GENE', 'O'): 0.422079556898288,
 ('O', 'I-GENE', 'STOP'): 6.294058408862034e-05,
 ('O', 'O', 'I-GENE'): 0.03741872901853501,
 ('O', 'O', 'O'): 0.9246458312860389,
 ('O', 'O', 'STOP'): 0.037935439695426

The following result shows the tag sequences generated with trigram HMM using the baseline tagger model, comparing it with the gold standard tag seuqences on the dev dataset for a sentence. The accuracy in terms of F1-Score is pretty low, it’s around 25.6%, as expected.



The Viterbi Decoding Algorithm to be implemented is shown in the next figure:


The following result shows the tag sequences generated with trigram HMM using the viterbi tagger model, comparing it with the gold standard tag seuqences on the dev dataset for the same sentence. The accuracy in terms of F1-Score improves quite a bit, it’s around 39.9% now (less number of words are tagged inaccurately), as expected.


  • Now, HMM taggers can be improved by grouping words into informative word classes rather than just into a single class of rare words. For this part we need to implement four rare word classes:
    1. Numeric The word is rare and contains at least one numeric characters.
    2. All Capitals The word is rare and consists entirely of capitalized letters.
    3. Last Capital The word is rare, not all capitals, and ends with a capital letter.
    4. Rare The word is rare and does not fit in the other classes.

These can be implemented by replacing words in the original training data and re-learning the HMM parameters with MLE. Be sure to also replace words while testing. The expected total development F1-Score is 0.42.

The following result shows the tag sequences generated with trigram HMM using the viterbi tagger model with rare word groupings, comparing it with the gold standard tag sequences and viterbi tagger model without rare word groupings on the dev dataset for a couple of sentences. The accuracy in terms of F1-Score improves more, it’s around 42% now, as expected (still lesser number of words are tagged inaccurately).



The following figures show some outputs produced with Viterbi using rare-word groups.























Summary of Results

Tagger Precision Recall F1-Score
0 Baseline 0.158861 0.660436 0.256116
1 Viterbi 0.541555 0.314642 0.398030
2 Viterbi with Rareword groups 0.534940 0.345794 0.420057