Unified Product Records and Issues of Data Quality

In my last post, I explained how the crawling infrastructure for collecting product data from the web is a highly complex engineering process, and touched upon some of the ways we tackle these problems at Indix. The first post of this series offered a high-level overview of how we gather and structure product data. Today, I want to cover the importance of a unified product record format and some of the issues that we face in data quality.

Challenges With Data Collection

The parser is possibly the most complex pillar in our whole crawl ingestion system since each site has its own product page format and extracting product information across stores is not easy and straightforward. On top of this, since Indix has a different crawl policy for each site, there’s no one-size-fits-all solution. The crawl policy for each site is dependent on several factors like site traffic, or time taken by the site to respond, for example. And once the product page is crawled, we need a parser that is specific to a particular store to extract data from the HTML and map it to the Indix taxonomy.

In addition to maintaining multiple parsers that collect data from different stores, we also need to ensure that the breakages/site changes are handled properly – the target store’s technology is closely tied to our ability to gather data from it. Change in product page template/tech stack can adversely impact the throughput from that particular site, and in some cases bring the data collection rate to zero for the particular store. Monitoring mechanisms need to be in place to catch these issues early, so that they can be fixed quickly.

Also, to make matters more complicated, even within a single store, sometimes there is no standard product page format. A large store like Amazon can have multiple templates for displaying product information – the templates for two different categories can be completely different from one another. For example, the gaming category will have information that is much more visual, like flash content and videos. On the other hand, the toys and games category presents information in a different format, focusing more on description and specifications. Our parsers should be able to handle these different formats and extract information across all templates.

The Need for Unified Product Schema

What we call the Indix taxonomy is the unified product record format, which defines how data is extracted from web pages and organized in a specific schema developed by us – for both product and price fields. It is necessary to have a unified product record taxonomy for the following two primary reasons:

  • The data collected from the stores constitutes a class of information called gathered fields. Indix algorithms act on top of this data to come up with a set of derived attributes. For example, the CategoryNamePath field results from the classification algorithm, which assigns a specific category to a particular product. Similarly, the BrandName field is derived through the brand extraction algorithm, which extracts the brand associated with the particular product from the gathered fields on the product page. For an algorithm like the matcher, which takes multiple fields as inputs and determines whether there’s a match between products, the data must be standardized and normalized in a specific pre-agreed upon taxonomy so that comparison and matching across stores can be done effectively.
  • Also, when we license the data, it needs to conform to a specific output schema so that customers can act and build insights on top of it. Indix has developed a comprehensive product record taxonomy that it uses to collect, store, and distribute product information. This includes fields like title, attributes, specification text, manufacturer name etc. which are product-specific, and also fields like sale price, availability, shipping information etc. which are more store-specific.

Consider the case of a unique identification number each retailer uses to refer to products in their own store (SKU as it is popularly known). We need to collect this identifier from multiple stores and index them as one single field name so that customers trying to obtain data from a particular store can pull them using the combination of this field and the store reference. Additionally, Indix has iteratively improved the depth of product taxonomy (as ecommerce has evolved) to keep it as up to date as possible.

Data Quality Challenges

As with any significantly large data corpus, data quality problems exist within the Indix product data corpus as well. The highly heterogeneous nature of the data sources, volume of data being collected, and the sheer complexity of the whole process means that ensuring the data collected is of high quality is a significant challenge.

On top of cases where data quality takes a hit due to parsers being broken or template patterns not being handled properly, there is quite a significant chunk of cases where the source of data itself is bad. There is a lot of dirty data out there. The challenge lies in having a robust mechanism to ensure that such bad quality data does not make its way into the Indix Cloud Catalog. We are constantly evolving and building systems to detect anomalies and outliers to filter out bad data.

To watch the full recording of our webinar on crawling and ingesting product data from the web, please click here.

Also published on Medium.

  Download the Pervasive Commerce White Paper
Ingesting and Structuring Product Data from the Web

Leave a Reply

Your email address will not be published. Required fields are marked *