A Two-Step Framework for Duplicate Detection

Alexey Grigorev
OLX Engineering
Published in
9 min readJan 16, 2020

--

Image by analogicus from Pixabay

Duplicated content is often an issue in many online platforms, especially in classifieds, where it requires special attention.

Duplicates cause many problems, including spam and fraud. In the first case, users create multiple copies of a listing in order to get more attention and increase their chances of selling an item. This spams the platform and makes it more difficult for other users to browse through the catalog of items. The second problem is more serious: fraudsters can create fake listings by copying existing content and trick honest users into giving away their money.

In this post, we propose a way to address this problem. In particular, this post covers:

  • User-generated content, the problems it causes, and how content moderation systems can solve these problems;
  • The two-step framework for approaching the duplicate detection problem;
  • Using simple heuristics for finding duplicate candidates at the first step of the framework;
  • Using machine learning to find actual duplicates as the second step.

Let’s start.

Content Moderation Systems

User-generated content is the content created by the users of a website. For example, posts on a social network (e.g. Facebook, Twitter), questions and answers on online forums or listings in online classifieds — all fall into the category of user-generated content. When uncontrolled, user-generated content may cause many problems.

For example, consider a case of online classifieds: a user wants to sell an item and they create a listing. However, the listing doesn’t get any attention, so they decide to create the same listing multiple times. Or people may try to sell something that is illegal, inappropriate or violating the terms of service: drugs, guns and other prohibited content. In all these cases such listings should not be allowed to go live.

Problems with user-generated content: duplicates, nudity, and weapons. Images by Benkai, Annalise Batista and WikimediaImages from Pixabay

When users create duplicates — multiple listings of the same item — it spams the platform and makes it difficult for other people to look for items on the platform. But apart from being annoying, duplicates can also be harmful.

Consider the following scenario: a seller wants to sell a cellphone, so they go to an online classifieds website and create a listing. In the meantime, a fraudster creates a copy of the listing, posing as a genuine seller. However, when contacted by buyers, the fraudster asks for a deposit and then disappears with the money.

To solve this problem, we should be more selective about what we allow people to post on the website. For that, we need to introduce a content moderation step: when published, the content should first go through a moderation system and only then go live.

Overall, a content moderation system may look like the following:

When a user posts a listing, it first goes to the automatic moderation system. This system makes an automatic decision whether to accept the listing and let it go live, or reject it. In cases when the system is not sure, it can ask people: the moderators. In such cases, the listing goes to the moderation panel where moderators review the content of the listing and make the final decision.

The automatic moderation system may consist of many components, and typically some of these components use machine learning when making their decisions. The duplicate detection system may be one of such components — and this is the component we will cover in this post.

The automatic moderation system consists of many modules. A Duplicate detection system is one of such components.

Let’s start with the theoretical foundation for building such a system: the duplicate detection framework.

Duplicate detection framework

One of the possible ways of approaching the duplicate detection problem is a simple framework with two steps:

  • Candidate selections steps. This is the step where we try to find the most likely duplicate candidates.
  • Candidate scoring steps. At this step, we get the candidates and use machine learning to select the listings that are real duplicates.

Let’s look at each step in more detail.

Candidate Selection Step

The first step in the duplicate detection framework is the candidate selection step. When we want to detect if a listing has duplicates, we need to compare it with all the other listings from our database. Typically there are quite a few records there, so it’s simply not feasible to compare the listing with everything that we have.

For example, imagine that our database contains only 1000 records. If we want to detect duplicates in this database, we’ll need to compare every record with all the others, resulting in almost half a million comparisons. Doing something like that for olx.in, olx.pl or olx.ua is simply impossible — there are many millions of active listings on these platforms!

To make it more efficient, we need to narrow down the set of listings for comparison to only the most likely candidate duplicates. There are many ways to narrow it down.

For that we can use:

While information retrieval and approximate KNN each deserve a blog post of their own, the heuristics are quite simple: we use our domain knowledge and common sense to come up with the best set of candidates.

For example, consider a listing from olx.in:

A listing from olx.in

We can use the information about the listing to find candidate duplicates: for example, we can select other ads in the same category and location (city/district). Also, we can use the information about the seller and check the other listings from the same seller; or the listings posted from the same device or from the same IP address.

Once we identify the possible duplicate candidates, we can move to the next step: candidate scoring.

Candidate Scoring Step

At the candidate scoring step, we take all the candidate duplicates we identified previously, and score them to select the listings that are actual duplicates.

For that we can use machine learning: we can build a model that takes in a pair of listings and outputs the probability that the listings are duplicates. If the probability is above a certain threshold, we treat the pair as a real duplicate.

The model takes in a pair of listings and outputs the probability that the listings are duplicates

To be able to use machine learning we need data. We already have the candidates from the previous step, but we’re missing one important part: the labels. That is, for each pair we need to know if it’s a duplicate or not.

Typically, to get this information, we need to involve people in the process and ask them. So we can present each pair we have and ask our annotators to label them as “Duplicate” or “Not duplicate”.

To get labels, we ask people if a pair is a duplicate or not

If there’s already a content moderation tool, then we can use the moderation panel and ask our moderators to help us. Alternatively, it’s possible to use crowdsourcing platforms — like Amazon Mechanical Turk or Yandex Toloka — for getting this data.

Feature Engineering

Once we collected the dataset, we can start using machine learning. First, we encode the target, so “Duplicate” becomes “1” and “Not duplicate” becomes “0”. Then the next step is doing feature engineering: we create features that capture the degree of similarity between the listings of a pair.

Feature engineering: for each pair, we create features that later we feed into a machine learning model

Once we have the features, we can train a model! What features can we use?

We can start with the simplest ones, like

  • The absolute and relative difference in price between the listings;
  • The geographical distance between the two listings;
  • Whether the listings were published from the same IP;
  • Whether the listings are in the same category;
  • And so on.

Using simple features like that is already good enough and can give a relatively good baseline. When Avito (the largest Russian classifieds website) ran their Duplicate Ad Detection competition on Kaggle, the first baselines contained only simple features like the ones above.

The textual content of two listings is quite important when determining if two listings are duplicates. That’s why we need to add some features that capture the similarity between titles or descriptions. For example,

  • The cosine similarity between titles (Bag of words);
  • The cosine similarity between descriptions (Bag of words);
  • Word2Vec similarity between titles; between descriptions.

Bag of Words

“Bag of words” is the simplest approach that lets us encode any text as a vector. To do it, we first take all possible words that we have in all our texts and then create a large table. The column of this table are the words, and the rows are the texts. If a text contains a word, the corresponding cell gets the number of times the word occurs in the text.

Bag of words: if a word is present in the title, we put “1” in the respective column and “0” otherwise

This gives us a simple way of encoding texts as vectors. Once we have vectors, we can measure how similar they are using the angle between these two vectors. This is exactly what cosine similarity gives us: it’s the cosine of the angle between two document vectors. It’s 0 if two documents are not similar at all — they have no words in common, and it’s 1 if two vectors are exactly the same.

The angle between two vectors can tell us how similar these two vectors are

With the basic features and text similarities, we can already get quite a good model.

Image features

When it comes to online classifieds, we must also include images: this is one of the most important aspects of a listing. The easiest way of doing it is by using perceptual image hashes.

We cover image hashes in the next post of the series. In this post, you’ll learn how perceptual hashes work and how to implement a system for detecting duplicates based on hashes.

Summary

  • Having a content moderation system is important for making sure the content quality on the platform is up to high standards. A duplicate detection system is one of the many components of the content moderation system.
  • To approach the problem of finding duplicates, use the two-step framework: first, detect candidates and then narrow down the candidates to the actual duplicates.
  • In the first step, we find candidates using domain knowledge and simple heuristics.
  • In the second step, we use machine learning to find the actual duplicates among the candidates.
  • Simple features such as the difference in price and distance and whether the listings were published from the same IP or device are already good enough for a start.
  • Adding text similarity features based on a bag of words will further help model performance.

We hope you find this article useful and can now apply the knowledge to get started on building your own duplicate detection system.

This post is based on the materials from the talk “Duplicates… Duplicates everywhere”, presented at AI tonight meetup in Kiev and Berlin Machine Learning meetup (slides).

If you liked the story, please help to spread the word about it.

--

--