Locality Sensitive Hashing Implementation for Approximate Fast Nearest Neighbor Search in R

In this article, an R implementation of locality sensitive hashing is going to be described and then used for fast approximate nearest neighbor-search in documents. The idea of document retrieval using LSH appears as one assignment in the Coursera Course Machine Learning Clustering and Retrieval.

As kd-tree based implementation of ANN search does not scale well with high dimensional data (such as text data), LSH provides an alternative implementation. As shown in the below figure, LSH guarantees that the probability that the true nearest neighbors will not be found with this approximate search technique as the search radius is increased becomes quite small.

math.png

Document Search with LSH

  • The following text dataset containing 50k+ wikipedia pages (documents) will be first used to retrieve the approximate nearest neighbors with LSH query.
Wikipedia documents
id URI name text
0 http://dbpedia.org/resource/Digby_Morrell Digby Morrell digby morrell born 10 october 1979 is a former australian rules footballer who played with the kangaroos and carlton in the australian football league aflfrom western australia morrell played his early senior football for west perth his 44game senior career for the falcons spanned 19982000 and he was the clubs leading goalkicker in 2000 at the age of 21 morrell was recruited to the australian football league by the kangaroos football club with its third round selection in the 2001 afl rookie draft as a forward he twice kicked five goals during his time with the kangaroos the first was in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell was traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 games for the blues before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullants carltons vflaffiliate in 2006 and acted as playing assistant coach in 2007 in 2008 he shifted to the box hill hawks before retiring from playing at the end of the season from 2009 until 2013 morrell was the senior coach of the strathmore football club in the essendon district football league leading the club to the 2011 premier division premiership since 2014 he has coached the west coburg football club also in the edflhe currently teaches physical education at parade college in melbourne
1 http://dbpedia.org/resource/Alfred_J._Lewy Alfred J. Lewy alfred j lewy aka sandy lewy graduated from university of chicago in 1973 after studying psychiatry pharmacology and ophthalmology he is a full professor and vicechair of the department of psychiatry at ohsu oregon health science university and holds an md and phd prior to moving to oregon in 1981 lewy was at the national institute of mental health nimh in bethesda maryland working with senior colleague thomas wehr in oregon he has worked closely with robert l sack as of december 2005 he had 94 publications available on pubmed he describes his research as follows my laboratory studies chronobiologic sleep and mood disorders these disorders include winter depression jet lag maladaptation to shift work and certain types of sleep disturbances relying on a very precise assay for plasma melatonin a hormone that has a clearly defined 24hour pattern of secretion biological rhythm disorders can be assessed and their treatment can be monitored current research is focused on developing bright light exposure and melatonin administration as treatment modalities for these disorders treatment must be precisely scheduled morning light exposure and evening melatonin administration cause circadian phaseadvance shifts evening light exposure and morning melatonin administration cause circadian phasedelay shifts totally blind individuals have 25hour circadian rhythms drifting an hour later each day unless they take a melatonin capsule at a certain time every day
2 http://dbpedia.org/resource/Harpdog_Brown Harpdog Brown harpdog brown is a singer and harmonica player who has been active in canadas blues scene since 1982 hailing from vancouver he crossed tens of thousands of miles playing club dates and festivals in canada the northwestern united states and germanyover the years he has issued seven cds in 1995 his home is where the harp is won the muddy award for the best nw blues release from the cascade blues association in portland oregon as well that year it was nominated for a canadian juno for the best bluesgospel recording teamed up with graham guest on piano his cd naturally was voted 1 canadian blues album of 2010 by the blind lemon surveybrown tours extensively with his guitarist j arthur edmonds performing their electric mid1950s chicago blues either as a duo or with the full band while he is home he juggles a few combos working many venues big and small he also leads the harpdog brown band which is a gutsy traditional chicago blues band in 2014 they released what it is comprising mainly original songs and a few classic covers influential blues promoter and broadcaster holger petersen called what it is browns best albumhe was just awarded the maple blues award in toronto for best harmonica player in canada 2014 and was honored with a life time membership to the hamilton blues society
3 http://dbpedia.org/resource/Franz_Rottensteiner Franz Rottensteiner franz rottensteiner born in waidmannsfeld lower austria austria on 18 january 1942 is an austrian publisher and critic in the fields of science fiction and the fantasticrottensteiner studied journalism english and history at the university of vienna receiving his doctorate in 1969 he served about fifteen years as librarian and editor at the sterreichisches institut fr bauforschung in vienna in addition he produced a number of translations into german of leading sf authors including herbert w franke stanislaw lem philip k dick kobo abe cordwainer smith brian w aldiss and the strugatski brothersin 1973 his new york anthology view from another shore of european science fiction introduced a number of continental authors to the englishreading public some of the authors in the work are stanislaw lem josef nesvadba gerard klein and jeanpierre andrevonthe year 1975 saw the start of his series die phantastischen romane for seven years it republished works of both lesser and betterknown writers as well as new ones ending with a total of 28 volumes in the years 19791985 he brought out trnaslations of h g wellss works in an eighteen volumes seriesrottensteiner provoked some controversy with his negative assessment of american science fiction what matters is the highest achievements and there the us has yet to produce a figure comparable to hg wells olaf stapledon karel apek or stanisaw lem rottensteiner describedroger zelazny barry n malzberg and robert silverberg as producing travesties of fiction and stated asimov is a typical nonwriter and heinlein and andersonare just banal however rottensteiner praised philip k dick listing him as one of the greatest sf writers from 1980 through 1998 he was advisor for suhrkamp verlags phantastische bibliothek which brought out some three hundred books in all he has edited about fifty anthologies produced two illustrated books the science fiction book 1975 und the fantasy book 1978 as well as working on numerous reference works on science fiction his close association with and promotion of lem until 1995 was a factor in the recognition of the latter in the united statesrottensteiner has been the editor of quarber merkur the leading german language critical journal of science fiction since 1963 in 2004 on the occasion of the hundredth number of this journal he was awarded a special kurdlawitzpreis
  • A term-document matrix (with several thousands of dimensions) will be created from this corpus texts with the Tf-idf features for each of the documents.
  • Now LSH query will be used to retrieve the top 10 wiki pages (documents) that are most similar with Barack Obama’s page shown below (the text is not shown for brevity).
Wikipedia Query Document
id URI name
35818 http://dbpedia.org/resource/Barack_Obama Barack Obama
  • The following two tables show the results obtained with the LSH query with maximum search radius 1 and 3 respectively.
The top 20 true nearest neighbors retrieved as a result of the brute-force Query for Barack Obama
id URI name dist
24479 http://dbpedia.org/resource/Joe_Biden Joe Biden 0.47
47086 http://dbpedia.org/resource/Ray_Thornton Ray Thornton 0.53
31424 http://dbpedia.org/resource/Walter_Mondale Walter Mondale 0.54
30845 http://dbpedia.org/resource/Winston_Bryant Winston Bryant 0.54
4409 http://dbpedia.org/resource/Joe_Lieberman Joe Lieberman 0.55
50453 http://dbpedia.org/resource/John_F._Tierney John F. Tierney 0.55
43580 http://dbpedia.org/resource/Romano_L._Mazzoli Romano L. Mazzoli 0.55
36453 http://dbpedia.org/resource/Bill_Clinton Bill Clinton 0.55
36365 http://dbpedia.org/resource/Don_Bonker Don Bonker 0.56
4600 http://dbpedia.org/resource/Ernesto_Scorsone Ernesto Scorsone 0.56
57636 http://dbpedia.org/resource/Joe_Sestak Joe Sestak 0.57
44126 http://dbpedia.org/resource/Mike_Crapo Mike Crapo 0.57
44486 http://dbpedia.org/resource/Rick_Santorum Rick Santorum 0.57
37914 http://dbpedia.org/resource/Herbert_Titus Herbert Titus 0.57
57109 http://dbpedia.org/resource/Hillary_Rodham_Clinton Hillary Rodham Clinton 0.58
18828 http://dbpedia.org/resource/Henry_Waxman Henry Waxman 0.58
9267 http://dbpedia.org/resource/Stan_Aronoff Stan Aronoff 0.58
50845 http://dbpedia.org/resource/Thomas_Barlow_(Kentucky) Thomas Barlow (Kentucky) 0.58
23206 http://dbpedia.org/resource/Richard_Cordray Richard Cordray 0.59
The 0-approximate nearest neighbors retrieved as a result of the LSH Query (maximum search radius 0) for Barack Obama
id bin dist URI name
28670 30538 0.75 http://dbpedia.org/resource/Enrique_Mart%C3%ADnez_y_Mart%C3%ADnez Enrique Mart%C3%ADnez y Mart%C3%ADnez
31383 30538 0.76 http://dbpedia.org/resource/Kaye_Darveniza Kaye Darveniza
643 30538 0.86 http://dbpedia.org/resource/Igor_Vamos Igor Vamos
48456 30538 0.92 http://dbpedia.org/resource/Derek_Beaulieu Derek Beaulieu
The top 10 approximate nearest neighbors retrieved as a result of the LSH Query (maximum search radius 1) for Barack Obama
id bin dist URI name
19060 30536 0.66 http://dbpedia.org/resource/Janet_Nguyen Janet Nguyen
27434 32586 0.68 http://dbpedia.org/resource/Thomas_Duane Thomas Duane
21687 32586 0.68 http://dbpedia.org/resource/Brian_Campion_(politician) Brian Campion (politician)
57498 30026 0.69 http://dbpedia.org/resource/William_Barclay_(politician) William Barclay (politician)
34687 32586 0.70 http://dbpedia.org/resource/Daniel_Frisa Daniel Frisa
5683 32586 0.70 http://dbpedia.org/resource/Jeffrey_S._Boyd Jeffrey S. Boyd
12588 30282 0.70 http://dbpedia.org/resource/David_Mayernik David Mayernik
25769 30666 0.71 http://dbpedia.org/resource/Karla_M._Gray Karla M. Gray
28714 32586 0.72 http://dbpedia.org/resource/Suzi_Wizowaty Suzi Wizowaty
15326 30539 0.72 http://dbpedia.org/resource/Chen_Shui-bian Chen Shui-bian
The top 10 approximate nearest neighbors retrieved as a result of the LSH Query (maximum search radius 2) for Barack Obama
id bin dist URI name
24479 21322 0.47 http://dbpedia.org/resource/Joe_Biden Joe Biden
50453 32330 0.55 http://dbpedia.org/resource/John_F._Tierney John F. Tierney
1416 63322 0.63 http://dbpedia.org/resource/Rand_Paul Rand Paul
13991 30314 0.63 http://dbpedia.org/resource/Rufus_Rodriguez Rufus Rodriguez
9762 30286 0.64 http://dbpedia.org/resource/George_S._Latimer George S. Latimer
15229 26186 0.65 http://dbpedia.org/resource/Ronald_Saunders Ronald Saunders
19060 30536 0.66 http://dbpedia.org/resource/Janet_Nguyen Janet Nguyen
27091 32584 0.67 http://dbpedia.org/resource/Herb_Russell Herb Russell
15744 22090 0.67 http://dbpedia.org/resource/Barry_Keene Barry Keene
6552 30314 0.67 http://dbpedia.org/resource/James_E._Lockyer James E. Lockyer
The top 10 approximate nearest neighbors retrieved as a result of the LSH Query (maximum search radius 3) for Barack Obama
id bin dist URI name
24479 21322 0.47 http://dbpedia.org/resource/Joe_Biden Joe Biden
4409 28506 0.55 http://dbpedia.org/resource/Joe_Lieberman Joe Lieberman
50453 32330 0.55 http://dbpedia.org/resource/John_F._Tierney John F. Tierney
50845 28234 0.58 http://dbpedia.org/resource/Thomas_Barlow_(Kentucky) Thomas Barlow (Kentucky)
28448 30680 0.60 http://dbpedia.org/resource/George_W._Bush George W. Bush
3401 28234 0.61 http://dbpedia.org/resource/James_Bilbray James Bilbray
46360 26202 0.61 http://dbpedia.org/resource/Jackson_B._Davis Jackson B. Davis
44573 26490 0.62 http://dbpedia.org/resource/Robert_Gerald_Lorge Robert Gerald Lorge
50304 31818 0.62 http://dbpedia.org/resource/Jerry_Meek Jerry Meek
1416 63322 0.63 http://dbpedia.org/resource/Rand_Paul Rand Paul
  • As can be seen from the above table and the following figure, that the top 10 query results become more relevant as the search radius grows.
  • As we increase the search radius, we find more neighbors that are a smaller distance away.
  • With increased search radius comes a greater number documents that have to be searched. Query time is higher as a consequence.
  • With sufficiently high search radius, the results of LSH begin to resemble the results of brute-force search.

res1

  • First the impact of search radius on the quality metrics for the neighbors retrieved are examined.
  • The above analysis have been limited by the fact that it was run with a single query, namely Barack Obama. So the analysis should be repeated for the entirety of data. Iterating over all documents would take a long time, so 10 documents are randomly chosen for analysis.
  • For each document, first the true 25 nearest neighbors are computed (using brute-force), and then LSH was run multiple times. Then the following two metrics were inspected:
    • Precision@10: How many of the 10 neighbors given by LSH are among the true 25 nearest neighbors?
    • Average cosine distance of the neighbors from the query
    • Then run LSH multiple times with different search radii.
    • few of the results are shown.

results

Next the effect of number of random vectors on the quality metrics for the neighbors retrieved are examined.

  • Again, 10 documents are randomly choosen for analysis.
  • For each document, first the true 25 nearest neighbors are computed (using brute-force), and then LSH was run with different number of random vectors, ranging from 5 to 20, by fixing the search radius to 3.

results2

  • A similar trade-off between quality and performance can be seen from the above results: as the number of random vectors increases, the query time goes down as each bin contains fewer documents on average, but on average the neighbors are likewise placed farther from the query.
  • On the other hand, when using a small enough number of random vectors, LSH becomes very similar brute-force search: Many documents appear in a single bin, so searching the query bin alone covers a lot of the corpus; then, including neighboring bins might result in searching all documents, just as in the brute-force approach.
  • The next figure shows how the entire space was partitioned with the random vectors, for LSH trained with different numbers of random vectors.

rvs.png

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s