Data is at the core of the Indix universe and data quality assurance and continuous improvement in data quality is one of the most important challenges we tackle every day. This becomes particularly interesting when dealing with datasets that have more than a billion records and to take the complexity level a notch higher, make that data unstructured.
As it was pointed out in a previous post, quality is a critical requirement for turning data into insights and action. In this series of blog posts, I will share a few insights on how we at Indix, are dealing with data quality challenges. I will elaborate on the various techniques that we use, but in this first post, let me set some context around the issues that make this such a hard problem to solve.
As is the case with most data-driven products, we gather product information from various sources including crawling the internet and ingesting feeds. Most of these data sources neither follow a common structure (however, there is hope that more sites will follow schema.org), nor do they use the same standardized taxonomy to describe their products.
It is quite common for the sites to change their structure periodically, which makes it harder for any crawling/parsing systems to keep up with this flux. More and more sites are becoming dynamic and serving contextualized product and price information. Above all, such changes to the structure and content are not predictable, although there are broad patterns that can be detected. For instance, majority of sites change their structure before peak holiday seasons.
Building systems that are capable of tracking changes to both the structure and the content of hundreds of millions of pages across the internet in near real time is one of our biggest challenges. In the absence of such a responsive system, the data collected from such millions of pages is bound to become stale and eventually useless.
One person’s definition of “good” is not the same as another person’s definition of “good”, so having a consistent definition for data quality and a calibration system to continuously measure it in line with that definition (at scale) is a critical component.
There are not many calibration techniques out there that work on a broad spectrum of datasets. This is a topic of ongoing research and is much talked about in many conferences around the world. Moreover, there is no one size fits all approach to decide the calibration technique. It needs to be chosen based on the algorithm being used.
Significant portion of data quality challenges are compounded by bad input data. It is important to have robust checks and cleansing techniques that catch these early on in a data pipeline. Even when relying on a well structured input feed from a product supplier, it is highly likely that the UPC or GTIN does not match with the product’s title. This could just be a human error or an intentional input to overcome the mandatory requirement of UPC/GTINs in the feeds.
Using machine learning algorithms is a proven technique for solving unstructured data challenges. However, choosing the right algorithm or if required, building a new algorithm to solve a specific problem remains a specialized skill. In addition, designing and running a data flow pipeline that repeatedly makes high quality predictions in a real time fashion to match the rate at which the data is ingested is an even more specialized skill.
Unlike code and configuration changes, which an engineering team can exert tighter control over through better configuration management processes and tools, changes to input data are controlled by the data producers, which often results in recomputing earlier data processes and predictions. Such recomputes mean that the quality of data is not permanent and will regress. There is a lot of maturity in automating code quality checks, however achieving similar levels of automation in data quality (at scale) is challenging.
Watch this space for subsequent posts on how we are tackling these challenges.