# Probabilistic Matrix Factorization to fill up the Missing User-Ratings for Recommendation with a Generative Model in Python

The following problem appeared as a project in the edX course ColumbiaX: CSMM.102x Machine Learning. The following problem description is taken from the course project itself.

The following figures show how the idea of Matrix Factorization can be extended to Probabilistic Matrix Factorization by assumes Gaussian generative process.

Here is the probabilistic matrix factorization algorithm:

The first few rows of the input dataset ratings are shown below, which is a comma separated file containing the data. Each row contains a three values that correspond in order to: user_index, object_index, rating. There are 1000 users, 100 objects and total 2000 ratings provided by the users.

USER OBJECT RATING
0 600 10 4
1 573 32 5
2 308 10 5
3 34 47 5
4 614 32 1

The PMF algorithm will factorize the low rank sparse rating matrix into 2 matrices U (user) and V (object) with dimension of latent feature space as d, as shown below:

The following figure shows how the log likelihood value (as shown in the above figure, the MAP objective function in the log scale) increases abruptly with the first few iterations of the PMF algorithm and then gradually converges, as expected:

The following figures and the animation show how the U and V matrices change with #iterations

Finally the following figure and the animation show how the proximity (similarity) of the users and the objects in the latent embedded space change with #iterations, shown for a first few users and objects (the blue points are users and green ones are objects):

The blue points represent users and the orange ones the objects in the next animation.

Advertisements

## 9 thoughts on “Probabilistic Matrix Factorization to fill up the Missing User-Ratings for Recommendation with a Generative Model in Python”

1. Hi. Nice post!
I have a question: How do you plot the embedded space? Using PCA? Thanks.

Like

2. Thanks, no it’s not with PCA. It’s just by scatter-plotting the users and the objects against the first two latent vectors.

Like

3. In the plot neg log likelihood vs iterations, isn’t the negative log likelihood suppose to decrease as you train?

Like

• You are right, the labels in the figure are wrong. The values computed represent the logarithm of the objective function value at different iteration. Thanks for pointing out, updated the typos.

Like

4. Daniel

Hi, in the very first “instruction” there is an equation (“… by maximizing the objective function:”). What norm do you use in the equation to regularize u and v? Frobenius? If so, why do you calculate it for each vector of the matrix separately and than sum them together? Why don’t you simply calculate the norm for the whole matrices u and v? The result is the same, isn’t it?

Like

• u_i and v_j are vectors, so regular L2-norms were used. I computed the log-likelihood iteratively for each (i,j) i.e., for each ratings and then summing all of them.

Since the data was in long format and not in wide format, i.e., in the data each row represented (u_i, v_j, r_ij) and it was not explicitly in form of a matrix M.

I doubt if the computation can be vectorized in terms of the entire matrix M, since the log-likelihood function is composed of summation of squared losses w.r.t. each individual rating and corresponding approximation with the inner product.

Like

• Daniel

Thanks for your quick reply Sandipan Dey!
In the objective function L (I believe it is taken from Ruslan’s original paper) only the squared distance between the original (M_ij) and predicted (u_i^T v_j) is computed for each rating in the training set. The euclidean (L2) norm of the vectors u_i is computed for every vector of the u matrix and summed together. This would be equivalent to frobenius norm of the complete matrix. I was just wondering why it would be written “so complicated” if it could be written jusb by as ||u|| but now I noticed that the vectors are squared beforehand. Mystery solved. thanks!

Like

5. Daniel

in the MAP inference coordinate ascent algorithm there is a capital I. What does it stand for? A mask matrix with 1 for values we know (user has rated this item) and 0 for those we don’t know (user has not rated this item)? That would not fit our needs as the equation obviously requires a scalar value, right?

Like

• I is the corresponding identity matrix (the update looks very much like the Ridge regression). There is no explicit mask matrix , since the ratings that the users have given only the corresponding entries (i,j) of the matrix were considered.

Like