GET DEMO Chat

 

 

The Data Quality Initiative at Indix

What Is the Data Quality Initiative?

Indix is a Data-as-a-Service company building the world’s largest Cloud Catalog, with information on more than 1.5 billion product offers. While marching toward our bold mission, we realized that we had an ad hoc process in place to measure the data quality. The process was neither efficient nor consistent. Hence, we wanted a more concrete way to measure the quality. Our goal was to have the best quality data and the only way to improve it was to have a feedback loop and constantly measure the quality.

“If you can’t measure, you can’t improve” was the motto behind the project.

Why Was it Needed?

  1. Data Quality Indicators (DQIs) quantify the quality (or the correctness) of data and by extension the algorithms that produce the data.
  2. DQIs allow us to get an estimate for coverage and quality for every aspect/facet of our data corpus.
  3. The Indix engineering teams can use it as a benchmark to constantly improve their algorithms.

What Aspects Did We Focus On?

We homed in on two different aspects for our project:

  1. Data Quality Metrics (at the source level): These allow us to measure the quality of our data at the source level. For example, www.nike.com has titles at 85% precision.
  2. Data Anomaly Checks (at the field level): These help us detect anomalous data at record and aggregate level. For example, a UPC should be 12 digits long. So UPCs with less than 12 digits will be rejected. Also negative prices will be rejected as a product cannot have a negative price.

What We Did – Coming Up With the Right Metrics

Initially, we had ad hoc and qualitative parameters like coverage, correctness, etc. to measure data value.

Instead of coming up with a different metric, we decided to use Precision and Recall as these are well-known parameters and find applicability in varied applications. Precision is a measure of correctness, and Recall is a measure of our capability to extract product content correctly from the sources we visit.

Image Credit: By Walber – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=36926283

Creating the Best Sample Set for Spot-Checking

In a corpus of over 1.5 billion products, verifying correctness of each and every record is not possible. To make sure we take the best sample set to measure the quality and get a uniform distribution across the 2,000 plus stores we were measuring, we selected the sample dataset based on the following parameters:

  • Category-based sampling
  • Products with the latest prices
  • Sample size from each site proportionate to the size of the site.

Using the above three parameters, we were able to get a sample dataset which was able to measure the data quality of the entire corpus in the most optimal way.

Creating a New Tool for Spot-Checking

We built a new tool for spot-checking similar to Amazon’s Mechanical Turk. It allows users to submit jobs requiring human intelligence.

The key differences from Mechanical Turk was that jobs were performed by individuals, selected and trained by Indix at multiple geographic locations rather than using crowdsourced workers.

Below is the screenshot of the tool we used:

Analyzing the Spot-Checking Test Results

The results from Spot-Checking Tool are analyzed and based on the business requirement the fields having low precision/recall values are taken up and fixed accordingly. Below is the screenshot of the results of the spot-checking field wise:

Constant Measurement and Feedback Loop

We constantly measured the data we wanted to improve. For example, if the precision of the product title field was less than 75%, we focused on fixing the titles and later measured it again to check for improvement. Various dashboards were created to keep a track of the data quality of 2000+ sites we have in the Indix Cloud Catalog.

Below is the dashboard giving the overall precision/recall numbers, categorized by the field.

Here is another dashboard where you can drill down on the performance of each individual site.

This Data Quality (DQ) measurement was done on a regular basis and each time the same original sample dataset was used. This helped to ensure that the precision/recall numbers were being benchmarked against the original dataset.

Data Anomaly Checks (Field-Level Rules)

In addition to the above data quality metrics at a store level, we created field-level rules to validate and pick records of high quality to use in the static score computation.

We focused on the following five fields:

  • UPC
  • Brand name (derived attribute by our Brand Extractor)
  • ImageURL
  • Category path (derived attribute by our Classifier) but compared with bread crumb
  • Title

Rules specific to fields were applied. For example, ImageURL shouldn’t be a malformed URL. It should be “http” or “https” and should of one of the following types: TIF/TIFF, JPG/JPEG, BMP, GIF, PNG.

Also along with specific rules, generic rules such as string should not include “NA” or “Null” or just special characters and repeating tokens were also applied to the five fields mentioned above.

Results and Impact

This data quality effort helped us develop benchmarks, and for the first time, the engineering teams had a sense of the quality of the data, which they use to constantly improve their algorithms.

Impact: Precision/Recall of fields was up by 5-10%.

Earlier, the field team used to report qualitative metrics based on anecdotal evidence. But after building our quantitative data quality metrics, everyone in our field team had more backing to their statements with real metrics, and can approach prospects with more confidence.



Also published on Medium.

Leave a Reply

Your email address will not be published. Required fields are marked *