Qualifying Image Quality — Part 2

Mohit Sharma
OLX Engineering
Published in
7 min readJan 2, 2019

--

If you were buying this car, which one of these ads would you click on?

In the first part of our series of blogs, we talked about the importance of images on our platform, the impact of image cropping on quality and how we quantified it.

In this second part, we measure quality in terms of other aspects such as brightness, contrast, sharpness of the image etc.

We iterated through a number of models to measure image quality score. Our final and most scalable solution was

a deep convolutional network that scores images based on quality. We also made sure the network was light enough to keep computation cost low, and finally deployed the model on-device using TensorFlow Lite.

To rehash a bit on how our platform works, people post advertisements of stuff they want to sell and other people contact them if they find their ads interesting.

Among the first and the most prominent things a buyer notices about the ad is the image of the item. And so, the quality of that image determines how many buyers click on that ad or contact the sellers.

Now when a new buyer lands on the platform and is welcomed by images like these

Harsh sun glare and poor lighting

the buyer might not want to explore these ads any further. However, if the buyer sees images like these

Good consistent light across the picture

the buyer is much more likely to click on the ad to find out more.

This is a problem for both buyers and sellers since sellers are missing out on potential buyers and buyers are missing out on good deals only because the pictures were poorly taken.

This makes it extremely important for us to quantify image quality so we can identify poor quality pictures and then either reach out to sellers asking them to retake the picture or possibly fix it ourselves.

The challenge in assessing image quality is that judgement of aesthetics tends to be very subjective in nature. We tried to tackle it in the following ways:

Attempt 1: NIMA

We looked at pre-existing models that try to address this problem such as Neural Image Assessment Model(NIMA) by Google. The model comprises of a deep CNN that is trained to predict which images a typical user would rate as looking good (technically) or attractive (aesthetically).

On applying the model directly to our data, we found out that for most of our images, the model gave scores that lied in a very narrow range and hence it wasn’t of much use to us directly.

We believe this is due to the difference in the nature of the training data for NIMA and our images.

Attempt 2: Feature Engineering and Modelling

To counter this, we tried to identify certain attributes of images that intuitively relate to image quality and created features to capture those attributes.

We started off with some obvious features such as brightness and contrast of the image. These metrics give us a macro idea of the picture but aren’t very useful just by themselves.

We needed to determine if the brightness and contrast were consistent across the image or if there were any local pockets of extreme brightness/darkness. This is how we solved it:

We take an image and break it down into foreground (Object) and background.

Original Picture
Decomposed into background and foreground

We use the MASK RCNN mentioned in the first blog to achieve this. Another important aspect of good pictures is spatial consistency of light, i.e. no corners or parts of the image which are extremely dark or too bright.

To do this, we further broke both the foreground and background images down into four quadrants and compared the brightness and contrast metrics of object and background in each quadrant.

Large difference between background and object characteristics

When there is a large difference between object brightness/contrast and background brightness/contrast, it tells us the object is different from the background and therefore distinctly visible. We added another feature to compare the brightness of the darkest quadrant with the brightest quadrant.

We included other metrics such as the number of completely white pixels and completely dark pixels since they are good indicators of image quality. The NIMA score for these images was also used.

We hand labelled more than 1k images as good or bad to create our training data and trained a GradientBoostedDecisionTree model to learn the differences between the good quality and bad quality images in the feature space.

The trained model had an AUC score of 0.9 on test data.

But since this solution might not apply as well to other object categories with different characteristics such as for smaller objects with consistent indoor lighting etc., we then tried to create a more scalable solution not based on engineered features.

Attempt 3: Deep Learning Model

Transfer learning is the domain of machine learning where a model developed for one task is reused as the starting point for a model on another task.

In this instance, we used the MobileNet architecture pre-trained on the popular ImageNet dataset for object classification tasks.

MobileNet is lighter in comparison to other conventional CNN’s. This makes the network ideal for deployment on platforms with lower compute capacity like smartphones. We chose it since on-device deployment was our ultimate objective.

We needed to build a Ground Truth Dataset for training and leveraged the power of Amazon Mechanical Turk to crowdsource the creation of our training data.

The labelling process involved getting each image labelled as good or bad by 5 different observers. Since label was either a 0 or 1, an image could have a quality score ranging from 0 to 5.

We had over 5k images labelled from M-Turk which formed our training data.

So now we have the images and the labels and we need a deep network trained on the images to predict the labels. Convolutional neural networks have shown great results in image classification tasks.

Model Details

We used the MobileNet V2 architecture trained on ImageNet Dataset as the starting point of our model. It is a deep neural network having many convolutional layers followed by a fully connected layer at the end.

It makes use of Depthwise Separable Convolutions instead of standard convolutions. The standard convolution operation has the effect of filtering features based on the convolutional kernels and combining features in order to produce a new representation. The filtering and combination steps can be split into two steps via the use of depthwise convolutions making the process much cheaper.

Since our dataset is similar to the ImageNet dataset in terms of the classes (cars, trucks) and it is only our labels which are different, we do not need to re-train whole of the network. We only need to remove the last fully connected dense layers and add our own fully connected layers which can be trained keeping all the convolutional layers fixed.

We train the network to predict the score of an image (0–5).

The model had a root-mean-squared-error (rmse) of 1.4 on the training set and 1.9 on the validation set. Some of the scores given by model:

Left Picture: 1.9, Right Picture: 4.5

Deployment

We have 2 options to deploy the model:

  • Server Deployment
  • In-App Packaging

We chose to go with in-app packaging over server deployment so that we can do the predictions at real time when a user is choosing images to post. Also, you don’t have problems with scaling and maintaining servers when you want to expand to more users.

We used the TensorFlow Lite (TFLite) platform to deploy the model in-app. The model’s weights were quantised (reduced to int8 format from float) to reduce the size of the model file. After quantising the weights and doing optimisations around the size of the network, we have a model file which is less than 450 kb which is light enough to ship with the app. Add to that the TFLite libraries which require less than 1 MB which help in keeping the app lightweight.

While selecting pictures to upload for his product to sell, stars are shown over images indicating their quality according to the model.

Quality score being returned on selecting an image

To summarise, we wanted to measure image quality for which we used NIMA, then tried handcrafted features, and finally did some transfer learning on a pre-trained MobileNet. We then deployed the model on our Android app using TensorFlow Lite.

I would like to thank the team — Akash Gupta, Vladan Radosavljevic, Udit Srivastava and Nicolas Quartieri

--

--