Qualifying Image Quality — Part 1, Cropped Images

Akash Gupta
OLX Engineering
Published in
10 min readSep 3, 2018

--

OLX is a platform where buyers meet sellers. Sellers post advertisements of the stuff which they want to sell and the buyers contact them if they find their ads interesting.

An advertisement from a seller has content about the item on sale like its images, its price, description and title.

The title and description tell more about the condition and variant of the product, the price tells how good a deal it is, but it is the images which tell how attractive the item is and how much attention it deserves.

Thousands of Ads are posted on OLX each day. Most of the ads have a cover image which is usually the first thing a customer notices about the advertisement.

Images play an important role in classifieds as they influence whether the customer would want to go to the ad-detail page as well as whether he or she would want to interact with the seller or not.

Let’s say our average Joe posts an advertisement on OLX to sell his old car. He fixes a price and uploads some images for the car.

However, his images of the car look like this :

He waits for the buyers to come and connect with him. However, they don’t and the few buyers he gets are offering much less than his expectation. Our innocent Joe does not have any idea why there are no takers for his offering. Apparently, the buyers see the image and ignore it and don’t even try to connect with him.

A small research conducted internally which studied the effect of various ad- related parameters on clicks found out that:

Users are 20% less likely to go to the ad detail page if the cover image of the advertisement does not contain the item in full or has poor brightness/contrast.

Here are some examples of poorly cropped images :

Poorly cropped image examples

And here are some examples of non-cropped images:

Non-cropped image examples

In our quest to keep solving our customers’ problems, we started to work on the said problem and providing the feedback on image quality to our innocent sellers. After all, more ad clicks mean better chances of selling and getting the best price for your item. Happy seller means happy OLX.

Let’s Dive into the Solution

To tackle the first problem of detecting whether the full object is in the image or not, i.e. whether the image has been wrongly cropped or not, we need a smart solution able to identify the main object being pictured, as well as its precise location in the frame. Quite a lot of work has been done on object detection and localization where multiple objects in the image can be identified. RCNN, Fast-RCNN, Faster-RCNN are bounding box proposal networks which propose many bounding boxes in an image and try to classify the objects in the boxes. They learn to minimise the error so well that the box fits the object precisely and the object is correctly classified. Mask-RCNN takes it a step forward by trying to detect the boundaries around objects.

We have used object detection models in combination with image processing algorithms and traditional machine learning models to get to the final output. We built the initial version of our algorithm around cars as that is the most important category for us at OLX. We go through a pictorial step by step run of our algorithm on one of the more complicated examples.

We start with a colour image of a car.

Original image

We first get the image through a Mask-RCNN model to get the approximate masks of all the objects present in the image.

Mask-RCNN is a state-of-the-art object localization model which is used to localize the objects in an image and it also tries to form the masks around those objects.

Underneath it uses Convolution Neural Networks to classify the objects and form the boundaries. We used a pre-trained Mask-RCNN model on the COCO-dataset. Let’s see MaskRCNN in action on our image.

Multiple Object Masks formed by Mask-RCNN algorithm

One of the issues with MaskRCNN is that the masks are not pixel perfect especially near the edges. Our use case requires finding exact position/pixels of the car so that is a big drawback. In this example just taking the masks from the MaskRCNN would have also led us to believe that the car is well within the boundaries, but it is not quite true.

Next, we try to find the exact boundaries of the objects. For this, we use Deeplab which is a state-of-the-art image segmentation model which produces pixel level classification. It classifies each pixel into one of the object classes using information from the surrounding pixels as well as the overall image. The result is a pixel-perfect segmentation map of the image. Again, we used a pre-trained network on the COCO-dataset. Let’s see Deeplab in action on our image.

Image segmentation Mask produced by using Deeplab

The only drawback of using this model is that it does not differentiate between multiple objects of the same class, i.e. two cars standing next to each other would be catered by the same mask. So when multiple cars are present in a frame, we will not get the mask of the main car being photographed.

Marrying MaskRCNN & DeepLab:

To harness the best parts of both models and to overcome the drawbacks, we combined these models.

The MaskRCNN model is used to calculate the areas of the multiple cars in the image and the car with the largest area is labelled as the main car while others are labelled as background cars.

Main Car (from Mask-RCNN)
Background Car(s) (from Mask-RCNN)

The masks of background cars, as obtained from MaskRCNN model are added together to form the complete background car mask.

The background car mask is subtracted from the image segmentation mask obtained from the deeplab model.

Mask after subtraction of Background Cars Mask from Deeplab Mask

Now we are left with an almost perfect mask of the main car with some connected tails and disconnected islands left from the subtraction.

To remove these we apply gaussian blur to the image, which basically averages each pixel by its neighbours helping in disconnecting the fine tails.

Mask after Gaussian Blur

Then we find connected components in the Mask and the largest connected component is retained leaving out the smaller ones.

The 2 different connected components in the mask

The final output is the mask of the car without any deformations and capturing the boundaries correctly.

Final Mask

Now as we are done with calculating the mask of the car, we need to use it to identify whether its a cropped car or not. To do this, we use machine learning to avoid having rules such as:

If distance from left edge > 10 pixel => Good Car, else Bad Car

The rules have maintenance issues as well as imposing a hard limit can result in erroneous predictions in case of small inaccuracies from the previous mask making model.

So we prepare a dataset of hand labelled cropped and un-cropped images and features to identify the cropped images from the un-cropped ones. We don’t need a big dataset here, as we extract only a few features whose parameters we need to learn.

The Features we extract are:

  • Distance of rightmost mask pixel from the right edge, and similar distances from left, top and bottom.
  • Sum of the pixels on each edge in the mask
Features: Distance from edges (top, right and bottom) and Sum of Pixels (left)

The output variable is whether the car is cropped or not.

We trained a GradientBoostedDecisionTree model to learn the differences between the cropped and non-cropped images in the feature space. GradientBoostedDecisionTree method is an ensemble method which forms multiple decision trees iteratively each improving on the errors of the previous iterations.

We had prepared a dataset of 2400 images for training, 600 for validation and 400 for testing. We trained the model on the training dataset and used the validation dataset for fine tuning the hyperparameters. Finally, the test dataset was used to evaluate the performance of the algorithm. The performance was judged on the basis of overall accuracy, i.e. the ratio of the number of correct predictions to the total number of predictions, as well as the AUC which represents how well the predictions separate the positive and the negative classes. The overall performance of the model on test dataset was 95% accurate with AUC of 98.3% implying a good class separation.

The probability of the image being a non-cropped image is output by the classifier and a probability below 0.5 signifies it is a cropped image.

When we run our example image through the model, we get a score of 0.1 which signifies a cropped image. We can also have a look at which features contributed towards this decision by using the eli5 library.

Feature Importances

The feature importances tell us that the distance from left edge, as well as the crop on left edge, is leading the image to be a cropped one. This way we can be more specific in our communications to customers telling them where and how much they got it wrong.

The probability and the feature importances are requested by the application which further communicates with the customers about the results.

We didn’t stop at building such algorithm which fits our use-case perfectly.

We thought, what if a single network can be trained to tell whether an image is cropped or not.

A well-known way to achieve good results in custom classification tasks is to take a pre-trained network architecture and fine-tune its last few layers. This way the model does not need to be trained from start and it can retain its already perfected edge detection and basic feature extractions from the initial layers. Only the last few layers which specialise at the task at hand are retrained. We ran a small experiment in which we trained a network having input as the original image and the output as the predicted class — cropped or non-cropped image. We used a pre-trained Inception architecture on the Imagenet dataset as the base and fine-tuned the last layers to our target. However, the performance was not satisfactory on this size of the dataset. The max accuracy achieved was 89% using this architecture because we are probably overfitting learning other things rather than the crops on the edges.

Another thing we tried was using RGB image + B&W Mask as a 4 channel input to an Alexnet architecture and training from scratch. However, the max accuracy achieved was 91%. Perhaps some further work using more complex networks and more data could help us achieve results comparable to the original algorithm. Another challenge which arises using such an arrangement is when we want to extend for more categories, we will need to train more networks for other categories and label more data, while we can use the pre-trained ones in the original algorithm we described.

Summary

We wanted to enhance seller experience by getting them more buyers and for that, we built a crop detection algorithm which tells the cropped images apart from the non-cropped ones. The algorithm uses a combination of deep learning based object detection models, image processing algorithms and machine learning models on the extracted feature set. The object detection models find the masks for the images followed by refinement from the image processing algorithms. Finally, features are extracted from the masks which are used by a machine-learned classifier to make its decision on the image being cropped or not. The overall classification accuracy achieved is 95% with an AUC of 98.3% which tells a good class separation by the predicted probabilities. The predicted probability of the input image being cropped or not is output by the algorithm, along with the edges and amount of cropping which is used by the client to inform the user about a possible reason for the lower number of contacts.

I would like to thank the team - Mohit Sharma, Vladan Radosavljevic and Udit Srivastava.

--

--