Indix is a Data-as-a-Service company building the world’s largest Cloud Catalog, with information on more than 1.5 billion product offers. While marching toward our bold mission, we realized that we had an ad hoc process in place to measure the data quality. The process was neither efficient nor consistent. Hence, we wanted a more concrete way to measure the quality. Our goal was to have the best quality data and the only way to improve it was to have a feedback loop and constantly measure the quality.
“If you can’t measure, you can’t improve” was the motto behind the project.
We homed in on two different aspects for our project:
Initially, we had ad hoc and qualitative parameters like coverage, correctness, etc. to measure data value.
Instead of coming up with a different metric, we decided to use Precision and Recall as these are well-known parameters and find applicability in varied applications. Precision is a measure of correctness, and Recall is a measure of our capability to extract product content correctly from the sources we visit.
In a corpus of over 1.5 billion products, verifying correctness of each and every record is not possible. To make sure we take the best sample set to measure the quality and get a uniform distribution across the 2,000 plus stores we were measuring, we selected the sample dataset based on the following parameters:
Using the above three parameters, we were able to get a sample dataset which was able to measure the data quality of the entire corpus in the most optimal way.
We built a new tool for spot-checking similar to Amazon’s Mechanical Turk. It allows users to submit jobs requiring human intelligence.
The key differences from Mechanical Turk was that jobs were performed by individuals, selected and trained by Indix at multiple geographic locations rather than using crowdsourced workers.
Below is the screenshot of the tool we used:
The results from Spot-Checking Tool are analyzed and based on the business requirement the fields having low precision/recall values are taken up and fixed accordingly. Below is the screenshot of the results of the spot-checking field wise:
We constantly measured the data we wanted to improve. For example, if the precision of the product title field was less than 75%, we focused on fixing the titles and later measured it again to check for improvement. Various dashboards were created to keep a track of the data quality of 2000+ sites we have in the Indix Cloud Catalog.
Below is the dashboard giving the overall precision/recall numbers, categorized by the field.
Here is another dashboard where you can drill down on the performance of each individual site.
This Data Quality (DQ) measurement was done on a regular basis and each time the same original sample dataset was used. This helped to ensure that the precision/recall numbers were being benchmarked against the original dataset.
In addition to the above data quality metrics at a store level, we created field-level rules to validate and pick records of high quality to use in the static score computation.
We focused on the following five fields:
Rules specific to fields were applied. For example, ImageURL shouldn’t be a malformed URL. It should be “http” or “https” and should of one of the following types: TIF/TIFF, JPG/JPEG, BMP, GIF, PNG.
Also along with specific rules, generic rules such as string should not include “NA” or “Null” or just special characters and repeating tokens were also applied to the five fields mentioned above.
This data quality effort helped us develop benchmarks, and for the first time, the engineering teams had a sense of the quality of the data, which they use to constantly improve their algorithms.
Impact: Precision/Recall of fields was up by 5-10%.
Earlier, the field team used to report qualitative metrics based on anecdotal evidence. But after building our quantitative data quality metrics, everyone in our field team had more backing to their statements with real metrics, and can approach prospects with more confidence.
Also published on Medium.