Using Low Rank Matrix Factorization for Collaborative Filtering Recommender System

In this article, low rank matrix factorization will be used to predict the unrated movie ratings for the users from MovieLense (100k) dataset (given that each user has rated some of the movies). The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. The data set contains about 100,000 ratings (1-5) from 943 users on 1664 movies.

  1. The movie rating matrix looks like the following (the original matrix is tranposed, so that the rows represent the movies and the columns represents the users (userids)).
    inp
  2. The missed ratings in the above rating matrix R are filled with NA values

    ## [1] "mean movie rating (averaged over users)"
    ## [1] 3.587565
  1. The intuition behind using matrix factorization is that there must be some latent features (much smaller than the number of users and the number of movies) that determine how a user rates a movie. We have a set \(n_m\) of movies and a set
    \(n_u\) of users and a rating matrix\(R\) of size \(n_m \times n_u\) be the matrix that contains all the ratings that the movies are assigned by the users. We are interested to discover \(n\) latent features. Then our task, then, is to find two matrics \(\theta\) (a \(n_m \times n\) matrix) and \(X\) (a \(n_u \times n\)matrix) such that their product approximates \(R\).
  2. We need to solve the following optimization problem then,

opt.png

5. The iterative optimization algorithm will be of the following form.

cfalgo.png
6. Gradient Descent is rather slow, so we use R implementations of advanced optimizations algorithms such as conjugate gradients, BGFS(quasi-Newton method), L-BFGS-B (with box constraints).

7. First the ratings are mean-normalized and then the optimization problem is solved using each of the optimization algorithms and the results are compared. The following shows how the cost function decreases with iteration for different optimization algorithms.

8. The following figures show how different optimization algorithms decrease the cost function.

optconv

## [1] "cost 2539671.77446919"
## [1] "cost 110703210116234992"
## [1] "cost 171953204347799"
## [1] "cost 236843652651.34"
## [1] "cost 175338070.663156"
## [1] "cost 465176.925808996"
## [1] "cost 761285.33287109"
## [1] "cost 283993.085173937"
## [1] "cost 292564.381990335"
## [1] "cost 200561.871486645"
## [1] "cost 194371.902575212"
## [1] "cost 183205.522454863"
## [1] "cost 159401.515020291"
## [1] "cost 160589.814933421"
## [1] "cost 134880.32779571"
## [1] "cost 144016.011686179"
## [1] "cost 116237.733309246"
## [1] "cost 123577.92281226"
## [1] "cost 93651.3657402116"
## [1] "cost 103451.908757633"
## [1] "cost 66603.1978365748"
## [1] "cost 77540.9776204579"
## [1] "cost 48887.7818033615"
## [1] "cost 55110.1053808507"
## [1] "cost 50103.7225230529"
## [1] "cost 46392.951264039"
## [1] "cost 47519.6982095049"
## [1] "cost 43698.0712625981"
## [1] "cost 44627.8733566642"
## [1] "cost 43849.4524326804"
## [1] "cost 43003.770024054"
## [1] "cost 43313.5809130187"
## [1] "cost 42028.8492587968"
## [1] "cost 42369.5514587616"
## [1] "cost 41516.9550324289"
## [1] "cost 41570.2085791009"
## [1] "cost 45684.0663600229"
## [1] "cost 41437.4615230649"
## [1] "cost 41433.1119193269"
## [1] "cost 41292.4694384034"
## [1] "cost 41182.9156572852"
## [1] "cost 41719.7111447933"
## [1] "cost 41020.2663759604"
## [1] "cost 41086.0173955788"
## [1] "cost 40772.789141717"
## [1] "cost 40853.166723211"
## [1] "cost 40666.8485243903"
## [1] "cost 40612.0462748436"
## [1] "cost 41006.892450841"
## [1] "cost 40488.9187188521"
## [1] "cost 40538.1989801163"
## [1] "cost 40307.8638409605"
## [1] "cost 40363.8767943251"
## [1] "cost 40247.6826834427"
## [1] "cost 40197.9208387596"
## [1] "cost 40571.2461347357"
## [1] "cost 40107.6135588487"
## [1] "cost 40143.0333888219"
## [1] "cost 39973.2780508646"
## [1] "cost 40010.1835560509"
## [1] "cost 39990.2772328695"
## [1] "cost 39928.7024461502"
## [1] "cost 39948.0679819594"
## [1] "cost 39826.0270698398"
## [1] "cost 39868.004575984"
## [1] "cost 39700.8950024452"
## [1] "cost 39739.1904318283"
## [1] "cost 39689.4531967685"
## [1] "cost 39639.2029290929"
## [1] "cost 39944.2483922302"
## [1] "cost 39595.3635304394"
## [1] "cost 39610.7519622577"
## [1] "cost 39537.1375443435"
## [1] "cost 39547.3834050538"
## [1] "cost 39579.5904965083"
## [1] "cost 39517.0707535681"
## [1] "cost 39525.3230714506"
## [1] "cost 39464.907471372"
## [1] "cost 39485.371633035"
## [1] "cost 39400.2676278037"
## [1] "cost 39417.8521098372"
## [1] "cost 39411.1152466184"
## [1] "cost 39379.3566132223"
## [1] "cost 39388.3829353795"
## [1] "cost 39331.0576840536"
## [1] "cost 39350.6645927431"
## [1] "cost 39273.3531851628"
## [1] "cost 39290.3747953842"
## [1] "cost 39271.8922915982"
## [1] "cost 39246.1551556989"
## [1] "cost 39403.2999374394"
## [1] "cost 39225.6139368805"
## [1] "cost 39232.6779815193"
## [1] "cost 39196.8493795368"
## [1] "cost 39201.4203124899"
## [1] "cost 39226.7175558125"
## [1] "cost 39186.7884986778"
## [1] "cost 39190.8217655857"
## [1] "cost 39161.142397195"
## [1] "cost 39170.8442224068"
## [1] "cost 39131.6269286513"
## [1] "cost 39138.4698314816"
## [1] "cost 39147.3193796639"
## [1] "cost 39121.8378003072"
## [1] "cost 39125.9279936105"
## [1] "cost 39097.3899447765"
## [1] "cost 39107.0797159382"
## [1] "cost 39067.4549825511"
## [1] "cost 39075.4468471847"
## [1] "cost 39073.67454709"
## [1] "cost 39058.7350982025"
## [1] "cost 39062.4749908748"
## [1] "cost 39038.8889948359"
## [1] "cost 39046.9032113447"
## [1] "cost 39014.3522302578"
## [1] "cost 39021.7324502234"
## [1] "cost 39013.161719758"
## [1] "cost 39001.8029936697"
## [1] "cost 39074.4065323951"
## [1] "cost 38992.7269973103"
## [1] "cost 38995.8019998184"
## [1] "cost 38981.1658305879"
## [1] "cost 38982.5619891051"
## [1] "cost 38997.0134342158"
## [1] "cost 38977.3407026205"
## [1] "cost 38978.8175719401"
## [1] "cost 38966.9338903725"
## [1] "cost 38970.7923148204"
## [1] "cost 38954.4591468554"
## [1] "cost 38957.1260701767"
## [1] "cost 38963.2907704649"
## [1] "cost 38950.242505971"
## [1] "cost 38951.9781044256"
## [1] "cost 38939.667486979"
## [1] "cost 38943.7925569246"
## [1] "cost 38926.8537692114"
## [1] "cost 38930.1432696019"
## [1] "cost 38930.2755195295"
## [1] "cost 38922.8868540014"
## [1] "cost 38924.5809546808"
## [1] "cost 38913.3433301448"
## [1] "cost 38917.2274507787"
## [1] "cost 38900.5247402818"
## [1] "cost 38904.512797356"
## [1] "cost 38899.2397119627"
## [1] "cost 38893.627865368"
## [1] "cost 38930.4790375024"
## [1] "cost 38888.7205813007"
## [1] "cost 38890.4129059709"
## [1] "cost 38882.1355377966"
## [1] "cost 38883.1311215427"
## [1] "cost 38890.6235869212"
## [1] "cost 38879.7785868956"
## [1] "cost 38880.7044730814"
## [1] "cost 38873.4881513174"
## [1] "cost 38875.8536754574"
## [1] "cost 38865.8605837532"
## [1] "cost 38867.5657347872"
## [1] "cost 38870.3278913033"
## [1] "cost 38863.3885550348"
## [1] "cost 38864.4147041202"
## [1] "cost 38857.2819626349"
## [1] "cost 38859.690617494"
## [1] "cost 38849.5714004326"
## [1] "cost 38851.7033007129"
## [1] "cost 38850.6208994205"
## [1] "cost 38847.089895075"
## [1] "cost 38848.1629389336"
## [1] "cost 38841.2962171178"
## [1] "cost 38843.676894238"
## [1] "cost 38833.650218175"
## [1] "cost 38836.0726418426"
## [1] "cost 38832.5438060366"
## [1] "cost 38829.6335239844"
## [1] "cost 38848.7391922616"
## [1] "cost 38826.765623174"
## [1] "cost 38827.7832337842"
## [1] "cost 38822.6780126209"
## [1] "cost 38823.460448534"
## [1] "cost 38826.4641911654"
## [1] "cost 38821.1787886528"
## [1] "cost 38821.7876969097"
## [1] "cost 38817.3894864097"
## [1] "cost 38818.8507207154"
## [1] "cost 38812.7671737101"
## [1] "cost 38813.9344637268"
## [1] "cost 38814.1688777286"
## [1] "cost 38811.2698656789"
## [1] "cost 38811.9077242966"
## [1] "cost 38807.5929183414"
## [1] "cost 38809.094201809"
## [1] "cost 38802.4864798615"
## [1] "cost 38804.1036036707"
## [1] "cost 38801.6481329647"
## [1] "cost 38799.6782664577"
## [1] "cost 38812.2884163112"
## [1] "cost 38797.6717000165"
## [1] "cost 38798.3937605968"
## [1] "cost 38794.7586923369"
## [1] "cost 38795.3734675996"
## [1] "cost 38796.9080689531"
## [1] "cost 38793.6733592582"
## [1] "cost 38794.1215843897"
## [1] "cost 38790.9248668284"
## [1] "cost 38792.0088547455"
## [1] "cost 38787.3463506415"
## [1] "cost 38788.3425075487"
## [1] "cost 38787.8276282412"
## [1] "cost 38786.1704256908"
## [1] "cost 38786.6789704968"
## [1] "cost 38783.419020844"
## [1] "cost 38784.5532431818"
## [1] "cost 38779.6971403683"
## [1] "cost 38780.9030181571"
## [1] "cost 38778.800630011"
## [1] "cost 38777.6658809859"
## [1] "cost 38784.8455625519"
## [1] "cost 38776.2065571621"
## [1] "cost 38776.7559533253"
## [1] "cost 38773.9625317498"
## [1] "cost 38774.5532464433"
## [1] "cost 38774.3849697177"
## [1] "cost 38773.1107938758"
## [1] "cost 38773.4782952411"
## [1] "cost 38771.0330861581"
## [1] "cost 38771.89265509"
## [1] "cost 38768.1070892014"
## [1] "cost 38769.0725617475"
## [1] "cost 38767.3022043973"
## [1] "cost 38766.4301835403"
## [1] "cost 38772.4591689463"
## [1] "cost 38765.2231623766"
## [1] "cost 38765.6764286619"
## [1] "cost 38763.4111890566"
## [1] "cost 38763.8714433022"
## [1] "cost 38763.9002633564"
## [1] "cost 38762.7397785912"
## [1] "cost 38763.0274844577"

9. The following figure shows different recommender systems predictions for the missing user ratings for the movie Titanic.

10. The filled matrix with the low rank matrix factorization looks like the following.

out.png11. Evaluation of the R recommenderlab models on the same data with cross-validation scheme is shown below.

## RANDOM run 
##   1  [0sec/1.07sec] 
##   2  [0sec/1.08sec] 
##   3  [0sec/0.92sec] 
##   4  [0sec/1.04sec] 
## POPULAR run 
##   1  [0.03sec/0.43sec] 
##   2  [0.03sec/0.25sec] 
##   3  [0.03sec/0.25sec] 
##   4  [0.04sec/0.25sec] 
## UBCF run 
##   1  [0.01sec/42sec] 
##   2  [0.01sec/41.54sec] 
##   3  [0.02sec/44.57sec] 
##   4  [0.01sec/41.9sec] 
## IBCF run 
##   1  [181.12sec/1.11sec] 
##   2  [162.72sec/1.06sec] 
##   3  [167.15sec/1.1sec] 
##   4  [164.53sec/1.05sec]

11. Evaluation of the R recommenderlab models on the same data with cross-validation scheme is shown below.

f5f6

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s