In my last post, I explained how the crawling infrastructure for collecting product data from the web is a highly complex engineering process, and touched upon some of the ways we tackle these problems at Indix. The first post of this series offered a high-level overview of how we gather and structure product data. Today, I want to cover the importance of a unified product record format and some of the issues that we face in data quality.
The parser is possibly the most complex pillar in our whole crawl ingestion system since each site has its own product page format and extracting product information across stores is not easy and straightforward. On top of this, since Indix has a different crawl policy for each site, there’s no one-size-fits-all solution. The crawl policy for each site is dependent on several factors like site traffic, or time taken by the site to respond, for example. And once the product page is crawled, we need a parser that is specific to a particular store to extract data from the HTML and map it to the Indix taxonomy.
In addition to maintaining multiple parsers that collect data from different stores, we also need to ensure that the breakages/site changes are handled properly – the target store’s technology is closely tied to our ability to gather data from it. Change in product page template/tech stack can adversely impact the throughput from that particular site, and in some cases bring the data collection rate to zero for the particular store. Monitoring mechanisms need to be in place to catch these issues early, so that they can be fixed quickly.
Also, to make matters more complicated, even within a single store, sometimes there is no standard product page format. A large store like Amazon can have multiple templates for displaying product information – the templates for two different categories can be completely different from one another. For example, the gaming category will have information that is much more visual, like flash content and videos. On the other hand, the toys and games category presents information in a different format, focusing more on description and specifications. Our parsers should be able to handle these different formats and extract information across all templates.
What we call the Indix taxonomy is the unified product record format, which defines how data is extracted from web pages and organized in a specific schema developed by us – for both product and price fields. It is necessary to have a unified product record taxonomy for the following two primary reasons:
Consider the case of a unique identification number each retailer uses to refer to products in their own store (SKU as it is popularly known). We need to collect this identifier from multiple stores and index them as one single field name so that customers trying to obtain data from a particular store can pull them using the combination of this field and the store reference. Additionally, Indix has iteratively improved the depth of product taxonomy (as ecommerce has evolved) to keep it as up to date as possible.
As with any significantly large data corpus, data quality problems exist within the Indix product data corpus as well. The highly heterogeneous nature of the data sources, volume of data being collected, and the sheer complexity of the whole process means that ensuring the data collected is of high quality is a significant challenge.
On top of cases where data quality takes a hit due to parsers being broken or template patterns not being handled properly, there is quite a significant chunk of cases where the source of data itself is bad. There is a lot of dirty data out there. The challenge lies in having a robust mechanism to ensure that such bad quality data does not make its way into the Indix Cloud Catalog. We are constantly evolving and building systems to detect anomalies and outliers to filter out bad data.
To watch the full recording of our webinar on crawling and ingesting product data from the web, please click here.
Also published on Medium.