In this article, an R implementation of locality sensitive hashing is going to be described and then used for fast approximate nearest neighbor-search in documents. The idea of document retrieval using LSH appears as one assignment in the Coursera Course Machine Learning Clustering and Retrieval.
As kd-tree based implementation of ANN search does not scale well with high dimensional data (such as text data), LSH provides an alternative implementation. As shown in the below figure, LSH guarantees that the probability that the true nearest neighbors will not be found with this approximate search technique as the search radius is increased becomes quite small.
Document Search with LSH
- The following text dataset containing 50k+ wikipedia pages (documents) will be first used to retrieve the approximate nearest neighbors with LSH query.
|0||http://dbpedia.org/resource/Digby_Morrell||Digby Morrell||digby morrell born 10 october 1979 is a former australian rules footballer who played with the kangaroos and carlton in the australian football league aflfrom western australia morrell played his early senior football for west perth his 44game senior career for the falcons spanned 19982000 and he was the clubs leading goalkicker in 2000 at the age of 21 morrell was recruited to the australian football league by the kangaroos football club with its third round selection in the 2001 afl rookie draft as a forward he twice kicked five goals during his time with the kangaroos the first was in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell was traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 games for the blues before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullants carltons vflaffiliate in 2006 and acted as playing assistant coach in 2007 in 2008 he shifted to the box hill hawks before retiring from playing at the end of the season from 2009 until 2013 morrell was the senior coach of the strathmore football club in the essendon district football league leading the club to the 2011 premier division premiership since 2014 he has coached the west coburg football club also in the edflhe currently teaches physical education at parade college in melbourne|
|1||http://dbpedia.org/resource/Alfred_J._Lewy||Alfred J. Lewy||alfred j lewy aka sandy lewy graduated from university of chicago in 1973 after studying psychiatry pharmacology and ophthalmology he is a full professor and vicechair of the department of psychiatry at ohsu oregon health science university and holds an md and phd prior to moving to oregon in 1981 lewy was at the national institute of mental health nimh in bethesda maryland working with senior colleague thomas wehr in oregon he has worked closely with robert l sack as of december 2005 he had 94 publications available on pubmed he describes his research as follows my laboratory studies chronobiologic sleep and mood disorders these disorders include winter depression jet lag maladaptation to shift work and certain types of sleep disturbances relying on a very precise assay for plasma melatonin a hormone that has a clearly defined 24hour pattern of secretion biological rhythm disorders can be assessed and their treatment can be monitored current research is focused on developing bright light exposure and melatonin administration as treatment modalities for these disorders treatment must be precisely scheduled morning light exposure and evening melatonin administration cause circadian phaseadvance shifts evening light exposure and morning melatonin administration cause circadian phasedelay shifts totally blind individuals have 25hour circadian rhythms drifting an hour later each day unless they take a melatonin capsule at a certain time every day|
|2||http://dbpedia.org/resource/Harpdog_Brown||Harpdog Brown||harpdog brown is a singer and harmonica player who has been active in canadas blues scene since 1982 hailing from vancouver he crossed tens of thousands of miles playing club dates and festivals in canada the northwestern united states and germanyover the years he has issued seven cds in 1995 his home is where the harp is won the muddy award for the best nw blues release from the cascade blues association in portland oregon as well that year it was nominated for a canadian juno for the best bluesgospel recording teamed up with graham guest on piano his cd naturally was voted 1 canadian blues album of 2010 by the blind lemon surveybrown tours extensively with his guitarist j arthur edmonds performing their electric mid1950s chicago blues either as a duo or with the full band while he is home he juggles a few combos working many venues big and small he also leads the harpdog brown band which is a gutsy traditional chicago blues band in 2014 they released what it is comprising mainly original songs and a few classic covers influential blues promoter and broadcaster holger petersen called what it is browns best albumhe was just awarded the maple blues award in toronto for best harmonica player in canada 2014 and was honored with a life time membership to the hamilton blues society|
|3||http://dbpedia.org/resource/Franz_Rottensteiner||Franz Rottensteiner||franz rottensteiner born in waidmannsfeld lower austria austria on 18 january 1942 is an austrian publisher and critic in the fields of science fiction and the fantasticrottensteiner studied journalism english and history at the university of vienna receiving his doctorate in 1969 he served about fifteen years as librarian and editor at the sterreichisches institut fr bauforschung in vienna in addition he produced a number of translations into german of leading sf authors including herbert w franke stanislaw lem philip k dick kobo abe cordwainer smith brian w aldiss and the strugatski brothersin 1973 his new york anthology view from another shore of european science fiction introduced a number of continental authors to the englishreading public some of the authors in the work are stanislaw lem josef nesvadba gerard klein and jeanpierre andrevonthe year 1975 saw the start of his series die phantastischen romane for seven years it republished works of both lesser and betterknown writers as well as new ones ending with a total of 28 volumes in the years 19791985 he brought out trnaslations of h g wellss works in an eighteen volumes seriesrottensteiner provoked some controversy with his negative assessment of american science fiction what matters is the highest achievements and there the us has yet to produce a figure comparable to hg wells olaf stapledon karel apek or stanisaw lem rottensteiner describedroger zelazny barry n malzberg and robert silverberg as producing travesties of fiction and stated asimov is a typical nonwriter and heinlein and andersonare just banal however rottensteiner praised philip k dick listing him as one of the greatest sf writers from 1980 through 1998 he was advisor for suhrkamp verlags phantastische bibliothek which brought out some three hundred books in all he has edited about fifty anthologies produced two illustrated books the science fiction book 1975 und the fantasy book 1978 as well as working on numerous reference works on science fiction his close association with and promotion of lem until 1995 was a factor in the recognition of the latter in the united statesrottensteiner has been the editor of quarber merkur the leading german language critical journal of science fiction since 1963 in 2004 on the occasion of the hundredth number of this journal he was awarded a special kurdlawitzpreis|
- A term-document matrix (with several thousands of dimensions) will be created from this corpus texts with the Tf-idf features for each of the documents.
- Now LSH query will be used to retrieve the top 10 wiki pages (documents) that are most similar with Barack Obama’s page shown below (the text is not shown for brevity).
- The following two tables show the results obtained with the LSH query with maximum search radius 1 and 3 respectively.
|28670||30538||0.75||http://dbpedia.org/resource/Enrique_Mart%C3%ADnez_y_Mart%C3%ADnez||Enrique Mart%C3%ADnez y Mart%C3%ADnez|
|21687||32586||0.68||http://dbpedia.org/resource/Brian_Campion_(politician)||Brian Campion (politician)|
|57498||30026||0.69||http://dbpedia.org/resource/William_Barclay_(politician)||William Barclay (politician)|
|5683||32586||0.70||http://dbpedia.org/resource/Jeffrey_S._Boyd||Jeffrey S. Boyd|
|25769||30666||0.71||http://dbpedia.org/resource/Karla_M._Gray||Karla M. Gray|
|50453||32330||0.55||http://dbpedia.org/resource/John_F._Tierney||John F. Tierney|
|9762||30286||0.64||http://dbpedia.org/resource/George_S._Latimer||George S. Latimer|
|6552||30314||0.67||http://dbpedia.org/resource/James_E._Lockyer||James E. Lockyer|
|50453||32330||0.55||http://dbpedia.org/resource/John_F._Tierney||John F. Tierney|
|50845||28234||0.58||http://dbpedia.org/resource/Thomas_Barlow_(Kentucky)||Thomas Barlow (Kentucky)|
|28448||30680||0.60||http://dbpedia.org/resource/George_W._Bush||George W. Bush|
|46360||26202||0.61||http://dbpedia.org/resource/Jackson_B._Davis||Jackson B. Davis|
|44573||26490||0.62||http://dbpedia.org/resource/Robert_Gerald_Lorge||Robert Gerald Lorge|
- As can be seen from the above table and the following figure, that the top 10 query results become more relevant as the search radius grows.
- As we increase the search radius, we find more neighbors that are a smaller distance away.
- With increased search radius comes a greater number documents that have to be searched. Query time is higher as a consequence.
- With sufficiently high search radius, the results of LSH begin to resemble the results of brute-force search.
- First the impact of search radius on the quality metrics for the neighbors retrieved are examined.
- The above analysis have been limited by the fact that it was run with a single query, namely Barack Obama. So the analysis should be repeated for the entirety of data. Iterating over all documents would take a long time, so 10 documents are randomly chosen for analysis.
- For each document, first the true 25 nearest neighbors are computed (using brute-force), and then LSH was run multiple times. Then the following two metrics were inspected:
- Precision@10: How many of the 10 neighbors given by LSH are among the true 25 nearest neighbors?
- Average cosine distance of the neighbors from the query
- Then run LSH multiple times with different search radii.
- few of the results are shown.
Next the effect of number of random vectors on the quality metrics for the neighbors retrieved are examined.
- Again, 10 documents are randomly choosen for analysis.
- For each document, first the true 25 nearest neighbors are computed (using brute-force), and then LSH was run with different number of random vectors, ranging from 5 to 20, by fixing the search radius to 3.
- A similar trade-off between quality and performance can be seen from the above results: as the number of random vectors increases, the query time goes down as each bin contains fewer documents on average, but on average the neighbors are likewise placed farther from the query.
- On the other hand, when using a small enough number of random vectors, LSH becomes very similar brute-force search: Many documents appear in a single bin, so searching the query bin alone covers a lot of the corpus; then, including neighboring bins might result in searching all documents, just as in the brute-force approach.
- The next figure shows how the entire space was partitioned with the random vectors, for LSH trained with different numbers of random vectors.