Skip to end of metadata
Go to start of metadata

Intro

The autocomplete query is slow for short queries (2 and 3 letters), because we use prefix search and there are many matches.
Note: this is largely mitigated by better Autocomplete Ranking, since we can safely limit to 50-100 results.

There's a fine tradeoff between functionality and speed:

  • not to exclude 2 letters, eg "Ur"
  • not to make the user wait too long after
  • not to make a short query too fast, thus delaying performance

Related Experience

  • The autocomplete at http://www.linkedlifedata.com is very fast. It doesn't use prefix queries (eg "asp" doesn't find "aspirin") but finds misspellings (eg "aspiri" finds "aspirin").
    • It uses editing (Levensthein) distance, which is a Lucene query option. It allows up to 3 misspelt chars:
      aspiri~3
    • it ranks matches using TF-IDF ranking
    • it uses the Forest autocomplete module
  • The autocomplete of another system of the LifeSci group uses prefix queries and is still fast. But it uses external Solr (parallelized Lucene), not the Lucene built into OWLIM

Suggestions

  • when the search is for 2 chars, do exact keyword search ("re" instead of "re*"), to make it faster. Works for the "Ur" case above
  • when the result list is bigger than 100 items, the SearchAPI should truncate it to 100 results (to make things faster in the js-part); the rationale is that the user doesn't need that many results in an autocomplete box; the results are sorted (currently semi-alphabetical, with the words starting with the typed word on top)
  • delay the 2&3-char query for further 0.5 seconds
    I.e. increase the UI typing timeout, so if the user starts typing "London", NO query for "Lon" will be made
  • Approximate vs Prefix Search
  • Cache short queries (Vlado is against such complications)

Approximate vs Prefix Search

The autocomplete at http://www.linkedlifedata.com is fast and usable.
It doesn't use prefix queries (eg "asp" doesn't find "aspirin") but finds misspellings (eg "aspiri" finds "aspirin"). Vlado asked Kosyo:

  • it uses editing (Levensthein) distance, which is a Lucene query option. It allows up to 3 misspelt chars with a Lucene "approximate" query like:
    aspiri~3
  • it ranks matches using TF-IDF ranking
  • it uses the Forest autocomplete module

Vlado asked Dominic whether it's necessary to use prefix queries: should "Rem" find "Rembrandt", or is it sufficient for "Rembrand" and "Rembrant" and similar mis-spellings to find it

While prefix search on the first word is not so necessary, it's necessary on subsequent words.

  • Eg on http://www.linkedlifedata.com:
    1. "aspiri" shows various matches for aspirin, including "litecoat aspirin"
    2. "aspirin li" shows nothing
    3. user has to type all the way to "aspirin litecoa" to get again "litecoat aspirin"
  • If we don't do prefix search, then for "british museum"
    1. "british" will show "british museum"
    2. "british m" will show nothing
    3. user has to type all the way to "british museu" to get again "british museum"

Maybe the best approach is to combine the two: use approximate search for the first word, and prefix for the rest, eg:

britis~3 m*

will find "british museum" and also "british mint", etc.

We finally decided that we want prefix

Timing

(Old observation)

  • The backend executes a prefix query "re*" in 14s (5482 results) and a non-prefix query "re" in 1.3s (16 results).
  • The frontend is much slower (minutes) because of js-part of the autocomplete component which is slow to deal with big result sets.

Data as of 3 Sep 2012. Timing is in ms:

Query #Results Owlim4 Local Owlim4 Remote Owlim5 Local Owlim5 Remote
oil p 27 906 1752 1113 425
rem 130 2143 10753 2721 1508
fieren 1 7 138 25 30
Vries 18 22 188 60 41
Poorter 3 10 91 17 29
Metropolitan 29 97 319 174 122
Lon 1151 2666 20075 4740 5146
London 448 1937 20606 2832 4426

Lucene Alone vs with Thesaurus Restrictions

Mitac timed the Lucene query alone, without the additional thesaurus restrictions

  • OWLIM/Lucene restriction alone: 200-500ms
  • additional SPARQL clauses (thesaurus restrictions): 1-5s (5-10 times slower)

Lucene Direct vs Lucene in Owlim

Mitac implemented Lucene direct indexing (using LuceneAPI). The index is created from the existing thesauri terms and has the following fields:

Field Description STORE/INDEX options
CONTENT all rdf:label-s Store.NO, Index.ANALYZED
RANK rdfRank from OWLIM Store.YES, Index.NO; Used in the final lucene ordering
THES List of the thesauri to which the current item belongs, separated by space Store.YES, Index.NO
LABEL prefLabel, previously computed by the rule (en, none, nl) - cached in this field Store.YES, Index.NO
SCOPENOTE The scope note, previously computed - cached in this field Store.YES, Index.NO
URI The URI of the item Store.YES, Index.NO
  • the indexing takes ~30mins, including the execution of the Owlim queries
  • the performance of the search varies from 22% (for easier searches, like "Poorter" above to 8800% for "Lon*"). See the spreadsheet for details
  • tested that the top results of the current search and the new search match (up to certain point at least; e.g. for 100 or more results we cannot expect the order to be the same - just for the items with a non-zero rank)
  • there are two differences, for "rem*" and for "ams*" where the LuceneDirect search results 8 and 1 less results, respectively. This is probably related to some of the searcher options in Lucene and seems OK
    lucene-direct-vs-owlim.xlsx

The autocomplete of an AZ project by our LifeSci group uses prefix, and is fast.
Vlado asked Dancho: it uses an external SOLR index (parallel Lucene), not the Lucene index embeded in OWLIM.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.