Learning-To-Rank: Sorting Search Results

How Machine Learning brings buyers and sellers together on OLX

Ilia Ivanov
OLX Engineering

--

Imagine you’ve decided to buy a sports car. If you go to a car store you might start by asking an employee the whereabouts of all the sports cars they sell. Then you’ll likely want to know which of those are the best ones for your budget. Although online marketplaces such as OLX can’t offer this kind of human interaction, they may provide even better service powered by a huge catalog and algorithms analyzing the content and user behavior.

In a typical use case where a customer types a query using natural language, the platform is expected to show all the matching items (e.g. all the sports cars), and desirably bring the best offers to the top of the result page. The problem of ordering/sorting items (or generally any entities) in the best possible way is known as ranking. The Machine Learning approach to ranking is also known as Learning To Rank or LTR.

In this article, I’ll describe

  1. Evolution of ranking at OLX
  2. How we integrated LTR into our ecosystem
  3. And whether it was worth it (spoiler — yes)

Before LTR

On OLX the online marketplace people regularly post classified ads in order to sell something. The entire catalog at the moment contains millions of ads, so users naturally have the opportunity to pick and choose the best offer for themselves. However, less than a year ago it wasn’t possible to sort search results by any criteria other than ad age or price. In some cases, users had to spend a significant amount of time comparing a bunch of ads or expanding their query (by adding filters/keywords) in order to find a good deal.

Demonstration of sorting by ad age (not optimal)

Only last year we started introducing a new sorting option called “Recommended”. We wanted to show the best ads to our customers, not just the newest or cheapest ones. After a considerable amount of research, and a huge engineering effort we ended up with a small set of factors or features that seemed to be promising for the recommended sorting. These features can be split into 3 main categories:

  1. Text similarity (e.g. whether the query was found in the ad title)
  2. Ad statistics (e.g. CTR)
  3. Temporal (e.g. the ad age)

And when I say a huge engineering effort, I mean it. To make such dynamic information like CTR available for ranking was no joke, so I really admire our engineers’ work behind these results.

After a few A/B-tests, all these features were hand-crafted into a simple formula that brought positive outcomes. Users much more often could find what they were looking for in the first positions of search results, consequently, the overall number of purchases increased significantly. The success became the evidence that there’s value in the project and we should invest even more in this core functionality of the platform.

LTR

If you’re familiar with Machine Learning, you can probably see why this problem is a typical ML use case. Multiple features, a clear target (conversion) dependent on the features, and most importantly big volumes of data asking to be collected and learned from! This is where the story of Learning To Rank at OLX begins.

As it was stated earlier, LTR can be considered as simply ML for ranking problems. You’ll see that there’s not much of a difference between LTR and classical supervised learning.

Every problem starts with a statement, so let’s try to formulate it:

  1. For a given user query there is a set of ads (search results)
  2. Every ad has a different probability of conversion
  3. Our objective is to sort the ads by this probability

Metrics

Why this objective? We compare our ranking solutions via A/B-testing using a comprehensive set of success, counter, and ecosystem metrics. “The number of conversions on the first page of search results” is our first measure to go for — it’s easy to understand/explain and it’s directly related to our key business metrics. This measure is what our objective (to sort ads by conversion probability) is designed to address.

When it comes to offline metrics reflecting the objective and allowing us to compare different models, we pick NDCG as the industry standard for ranking tasks. Since the conversion probabilities we’re interested in aren’t known, binary labels are used— whether an item was purchased or not. This means the higher a given algorithm puts items that were purchased — the higher the metric value.

Data

The target metric is defined, now we can attempt to tackle the problem. ML solutions in general begin with data, so let’s design our historical dataset. Imagine somebody went on the platform a few days ago to buy something, the search engine could find only 3 matching ads, and then the person purchased the second item. This session would be presented in the dataset in the following way:

Each row corresponds to a <query, ad> pair from a user search session. The label tells us whether the user made a purchase, to be more accurate we consider not only purchases but also messages/calls to sellers because transactions aren’t always confirmed. And finally, the features column combines all the item features we described before, some of them are query-agnostic (e.g. ad age, CTR), and some of them are not (e.g. title match and all sorts of text similarity measures).

Model

Now we know a formal objective, we’ve got the data, it’s time to build a model. The industry standard for tabular data is Multiple Additive Regression Trees (MART). One of these trees can be visualized like this:

In order to train a model, we apply Gradient Boosting and learn the parameters from the training data while optimizing our target metric. Although NDCG is a non-differentiable function and we can’t calculate the gradient, there are good approximations of NDCG optimization, one of them being LambdaRank. LambdaRank applied to MART is also known as LambdaMART. We use an implementation of this algorithm provided by XGBoost. Let’s not forget about hyperparameters optimization, cross-validation, and testing on a held-out dataset. And here we’ve got a trained model ready to be tested online!

A few words on how models are being served. We’re exploiting the capabilities of our search engine Solr, namely Solr LTR plugin. All the features are already defined in the ads collection, some of them are calculated in real-time (e.g. text similarity), and some of them are refreshed periodically (e.g. CTR). So we only have to upload the model and adjust our query to Solr. A high-level diagram of the system would look like this:

By the way, if you use Solr and you want to convert an XGBoost model into a Solr MART model, here is the code we’re using.

Results

Via A/B-testing we estimated the impact of the new sorting algorithm at the next levels:

  1. The number of conversions on the first page increased by 12%
  2. The overall number of purchases on the platform increased by 5%
  3. The number of unique buyers increased by 4%

Given the fact that the project aimed at the search journey, there was a concern that the solution might harm the seller experience, for instance by featuring only a small portion of sellers in the first positions. Contrarily in practice, the number of unique contacted sellers went up!

In order to validate our offline evaluation methodology we also tried to correlate offline metrics with the online ones. Down here you can see how the NDCG metric calculated on historical data correlates with the results of an A/B-test, namely the number of conversions on the first 10 positions of search results (normalized):

Relation between online (A/B) and offline (emulated) quality measures of different ranking algorithms

This picture makes us believe that the offline evaluation framework is quite reliable, i.e. we can select the best candidates based on offline scores. This is essential for making well-informed decisions before launching new models.

Conclusion

We’ve shared with you

  • How ads are sorted on OLX and why it matters
  • How LTR became a powerful tool that brings buyers and sellers together on OLX
  • How properly designed offline evaluation can help you make sound decisions and iterate fast

There’s still a lot of space for development on this front, we’re planning to introduce more factors in order to enhance the quality of the models, e.g. distance between a buyer and a seller, item price, etc. Apart from the modeling component, this project also involves a bunch of engineering challenges we’ll have another chance to cover. So stay tuned for more stories and feel free to reach out if you have any questions! :)

--

--