AutoSuggest Retrieval & Ranking (Part 2)

Published in

OLX Engineering

10 min readDec 22, 2020

In the previous post, we answered questions like “What is autosuggest?” and “Why do we need it?”, followed by the discussion around the techniques that we used to build the corpus for autosuggest at OLX.

In this post, we’ll be focusing on the algorithms or techniques that we used to convert the corpus into a full fledged autocomplete suggester. So, let’s get to it!

Old Algorithm

Our older autosuggest algorithm was mainly based on creating suggestions from the Ad fields like title, description, category (Cars, Mobile Phones, etc), mainly taking into consideration what the users are posting or the inventory that we have rather that what the users search for on our platform.

Problems

Does not take into account how users search.
Does not take into account which suggestions lead to better clicks/responses.
Does not take into account user’s interests.

Our new algorithm that we are going to discuss will try to overcome the above mentioned shortcomings.

Goals

Before we do a deep dive into building the suggester, let’s take a look at the problem statement first.

The above shown diagram is the flow diagram of what we are trying to build.

We already built the corpus in the last post, giving us a list of past user searched queries along with their respective occurrence counts (or frequencies).

Now, whenever the user types some query in the search box, we want to give suggestions to the user from the corpus. So, if we take a closer look, the process can simply be broken down into two parts —

Retrieval
Ranking

Let’s go through both of these steps in detail one by one.

What is Retrieval ?

Input : User searched query and the autosuggest corpus.

Output : Return the list of suggestions from the corpus matching the user searched query along with their frequency counts.

We want to build a system which would store our corpus and can answer the following question —

What are the top ‘N’ suggestions with max score such that they match with the user searched query?

Here, matching refers to the concept of comparing two strings such that there is some part common in both the strings.

For example :-

If a user types “scoote”, then we want to display suggestions like “scooter”, “scooty” or “electric scooter”.

Well, there can be various approaches of matching we can use here. We will be going through the following three strategies of string matching -

Infix match : Exact prefix match with the whole suggestion.
Blended Infix match : Exact prefix match with any word in the suggestion.
Fuzzy match : Prefix match within ‘k’ edit distance.

Now that we have an understanding of the strategies for getting the list of suggestions matching the user searched query, let’s take a look at the ordering or ranking of suggestions.

What is Ranking ?

Input : List of suggestions with frequency counts.

Output : Ordered list of suggestions according to frequency counts and/or any other external scoring factor.

There are multiple ways of ranking or ordering the list of suggestions among which the following are the common ones —

Sort only by frequency of each suggestion (score = f).
Sort by frequency and a combination of other important business metrics like the no. of listings opened by the user, the no. of replies after clicking on that suggestion, etc. (score = w1 * f1 + w2 * f2 + ……. + wn * fn, where f1, f2, …. , fn are the features which are important for ranking the suggestion and w1, w2, ….. , wn are the weights or the importance of the corresponding feature to which it is multiplied).

By applying any of the above mentioned scoring techniques, you will get a score corresponding to each suggestion in the corpus. So, we can just sort the relevant suggestions according to these scores.

Instead of storing the frequency, we could actually precompute this score for each suggestion offline and store it so that we don’t have to re-calculate this score for a suggestions again and again for each user query.

Let’s elaborate on the first approach of scoring which uses only the frequency to understand how we can fit it into our design which has different kinds of suggestions we got from infix match, blended infix match and fuzzy match.

Blending

The concept of merging the different kinds of suggestions is what we call as blending.

Following is the algorithm that we use for blending the suggestions from different strategies —

Fetch top ’N’ suggestions from each strategy (infix, blended infix and fuzzy) matching the user searched query.
Decay (or divide) the frequency of each suggestion that we got from a strategy by the strategy’s decay factor (d).
Add all infix suggestions to the final list.
Add all the blended suggestions which are not already a part of final list.
Repeat the previous step for fuzzy suggestions.
Sort the final list of suggestions by decayed frequencies.
Return the top ’N’ suggestions from the final list.

Decay factor (d) should obviously be in the following order —

infix < blended infix < fuzzy

Let’s take the decay factor of infix as 1, blended infix as 2 (X) and fuzzy as 3 (Y). Then, the algorithm would work as shown below —

The sorted list of suggestions would then be suggested as the user types in the query.

NOTE : You should also remove the suggestions which are not going to fetch any results or listings as that would be lead to a bad user experience.

Implementation

As you can see from the above HLD, there will mainly be 4 components involved. We’ll go through each of these components one by one in detail.

Solr

Apache Solr is an open source platform based on Apache Lucene project. Its major features include full text based search, real-time indexing, database integration, .., etc.

We would mainly be using the suggester component which will serve the purpose of our retrieval component that we discussed above.

If you have worked with Solr before, you might be aware that we would need to create a collection which would consist of two main files —

solrconfig.xml : It is mainly used to configure the high level behaviour of SOLR like the request handlers, components, cache configuration, .., etc.
managed-schema (or schema.xml) : It is used to define the properties or the structure of the document that we are going to store in SOLR.

Following the configurations that we are going to use —

Add the following suggest component to your solrconfig.xml. It defines the 3 strategies of string matching that we discussed above namely, AnalyzingLookupFactory (Infix), BlendedInfixLookupFactory (Blended Infix) and FuzzyLookupFactory (Fuzzy).

<searchComponent name="suggest" class="solr.SuggestComponent">          
   <lst name="suggester">             
       <str name="name">queryPrefixStrategy</str>             
       <str name="lookupImpl">AnalyzingLookupFactory</str>                
       <str name="dictionaryImpl">DocumentDictionaryFactory</str>               
       <str name="field">query</str>             
       <str name="weightField">cost</str>             
       <str name="suggestAnalyzerFieldType">queryKT</str>               
       <str name="buildOnStartup">false</str>             
       <str name="buildOnCommit">true</str>         
   </lst>         
   <lst name="suggester">             
       <str name="name">queryBlendedStrategy</str>             
       <str name="lookupImpl">BlendedInfixLookupFactory</str>                
       <str name="dictionaryImpl">DocumentDictionaryFactory</str>                
       <str name="field">query</str>             
       <str name="weightField">cost</str>             
       <str name="suggestAnalyzerFieldType">queryST</str>             
       <str name="buildOnStartup">false</str>             
       <str name="buildOnCommit">true</str>             
       <str name="highlight">false</str>         
  </lst>         
  <lst name="suggester">             
       <str name="name">queryFuzzyStrategy</str>             
       <str name="lookupImpl">FuzzyLookupFactory</str>             
       <str name="dictionaryImpl">DocumentDictionaryFactory</str>                      
       <str name="field">query</str>             
       <str name="weightField">cost</str>             
       <str name="suggestAnalyzerFieldType">queryKT</str>             
       <str name="buildOnStartup">false</str>             
       <str name="buildOnCommit">true</str>         
  </lst>     
</searchComponent>

Now, let’s add the request handler for this component in our solrconfig.xml.

<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">         
   <lst name="defaults">             
       <str name="suggest">true</str>             
       <str name="suggest.count">10</str>             
       <str name="suggest.dictionary">queryPrefixStrategy</str>              
   </lst>         
   <lst name="defaults">             
       <str name="suggest">true</str>             
       <str name="suggest.count">10</str>             
       <str name="suggest.dictionary">queryBlendedStrategy</str>               
   </lst>         
   <lst name="defaults">             
       <str name="suggest">true</str>             
       <str name="suggest.count">10</str>             
       <str name="suggest.dictionary">queryFuzzyStrategy</str>             
   </lst>         
   <arr name="components">             
       <str>suggest</str>         
   </arr>     
</requestHandler>

Now, that we have setup our request handler for the suggestion strategies that we are going to use to query our corpus in SOLR.

Let’s define our document structure (or schema) that we are going to use to store our processed user searched queries along with the score.

<schema name="autocomplete-v1" version="1.0">     
  <fields>         
    <field name="id" type="string" required="true" multiValued="false" />         
    <field name="query" type="string" required="false" indexed="true" stored="true" multiValued="false" />         
    <field name="queryKT" type="queryKT" required="false" indexed="true" stored="true" multiValued="false" />         
    <field name="queryST" type="queryST" required="false" indexed="true" stored="true" multiValued="false" />     
    <field name="_version_" type="long" indexed="true" stored="true" multiValued="false" />         
    <field name="cost" type="long" indexed="false" stored="true" multiValued="false" />     
  </fields>       <uniqueKey>id</uniqueKey>     
  <copyField source="query" dest="queryKT" />     
  <copyField source="query" dest="queryST" />  
  
  <types>         
    <fieldType name="string" class="solr.StrField"/>           
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0" />         
    
    <fieldType name="queryKT" class="solr.TextField" positionIncrementGap="100" >             
      
      <analyzer type="index">                 
        <tokenizer class="solr.KeywordTokenizerFactory" />               
      </analyzer>             
      
      <analyzer type="query">                 
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement=""/>                 
        <tokenizer class="solr.KeywordTokenizerFactory" />                    
        <filter class="solr.LowerCaseFilterFactory"/>               
      </analyzer>         
    
    </fieldType>         
    
    <fieldType name="queryST" class="solr.TextField" positionIncrementGap="100" >             
      
      <analyzer type="index">                 
        <tokenizer class="solr.StandardTokenizerFactory" />                    
        <filter class="solr.StopFilterFactory" words="stopwords_en.txt" ignoreCase="true"/>             
      </analyzer>             
      
      <analyzer type="query">                 
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement=""/>                 
        <tokenizer class="solr.StandardTokenizerFactory" />                     
        <filter class="solr.LowerCaseFilterFactory"/>               
      </analyzer>         
    
    </fieldType>     
  </types> 
</schema>

Let’s go through each of the above mentioned fields —

query : It is used to store the actual search query like maruti suzuki, iphone 11.
cost : It is used to store the score corresponding to the search query.
queryKT : This field is actual a slight modification of the query field as this is the one which we are using in the queryPrefixStrategy or for the infix matching and fuzzy matching.
queryST : This field is also a modification of the query field and it is used in the blended infix strategy.

Let’s now go through the field types —

string : This is nothing but a normal text field used by the query field.
long : This is used by the cost field for storing the score of the query.
queryKT : This field type stores (or indexes) the field as is and doesn’t do any modifications as we need to compare the whole string in case of infix matching and fuzzy matching. But on query time it will remove special characters and make all characters of user searched query lowercase before comparison.
queryST : This field type stores (or indexes) the field by breaking the multi word queries into multiple searchable indexes as it is going to used in query blended where we need to treat all the words separately so that matching can happen from anywhere. It filters out the stop words, removes special characters and make it lowercase.

For example :

An entry in the processed queries like “Maruti Suzuki, 200” will be stored as query = Maruti Suzuki, cost = 200, queryKT = maruti suzuki, queryST = [maruti, suzuki]

Now, that we have finally built our retrieval component we can run queries like —

{SOLR_HOST}/solr/{SOLR_COLLECTION_NAME}/suggest?suggest=true&suggest.build=false&suggest.dictionary=queryPrefixStrategy&suggest.dictionary=queryBlendedStrategy&suggest.dictionary=queryFuzzyStrategy&suggest.q=ca

It will give us results from all three strategies for query = “ca” like below —

Redis

We are going to use redis for caching the response of autosuggest for a particular query.

So, whenever a query comes in for a particular keyword then the response corresponding to the keyword will be cached and can be used to serve the subsequent requests for the same query to improve response time and prevent hits to Solr.

One optimisation that you could do to prevent overuse of redis is that instead of storing all the responses just store responses only for those queries whose length is say less than ‘X’. As the no. of possible distinct keys is going to increase exponentially with each character so for ‘X’ characters, the no. of possible distinct keys can be 26 ^ X.

AutoSuggest Service

Autosuggest service will be responsible for the following things —

Serve the ‘/suggest’ endpoint.
For each request, get the response from cache if present.
Else fetch the suggestions from all three strategies.
Blend them and rank them according to the score as we discussed in the ranking section earlier.
Finally, store the response containing the top ‘N’ suggestions corresponding to user searched query if its length is less than ‘X’ and return it.

Preprocessor

There would mainly be 3 responsibilities of this component —

Fetch : Fetch the past user searched queries.
Process : Process the queries using the algorithms discussed in the previous post.
Index : Index (or store) the corpus (or processed user searched queries along with their respective scores in SOLR.

We can schedule a cron or a scheduler to regularly run the processor.

For example :

If you are taking the last month’s user searched queries for processing then you could run the preprocessor every week or every day depending on how frequently your user searched queries change.

A/B Test Results

India : Null searches decreased by 2% and autosuggest adoption increased by 4–5%.
Indonesia : Null searches decreased by 5.1%, autosuggest adoption increased by 31% and search conversion increased by 1.2%.
Pakistan : Null searches decreased by 20%, autosuggest adoption increased by 4% and search conversion increased by 4.1%.
South Africa : Null searches decreased by 5%, autosuggest adoption increased by 8.5% and search conversion increased by 2.4%.
LATAM Countries (Argentina, Colombia, Ecuador, Costa Rica, Guatemala, Panama, Peru, El Salvador): Autosuggest adoption increased by 15.7% and search conversion increased by 1.4%.

Summary

We discussed various techniques for retrieval & ranking and what it takes to build a full fledged scalable autocomplete suggester including the various components involved.

Next Part:

In the next part, we will be discussing various techniques to build the catalog suggestions and further improvements or optimisations that can be done in the suggester to increase the adoption.