Working With Product Data on the Internet: Part 3

This is the third and last in a series of posts about the challenges of working with product data on the internet.

For the past few weeks, we’ve looked at some of the issues and challenges in dealing with product data on the internet. Finally, we are done with the wailing and are now at the solution-ing. Hurray! In this last blog of the series, we’ll examine some of the steps, tools, and techniques to work better with web-based product data. These tips would not only help make your life easier but should also increase the yield of the value you are able to exploit from the data you are working against.

How Do You Clean Dirty Data?

At Indix, our energies have been focused over the past couple of years to help master the problems you are facing today with product data. We understand deeply the complexity and challenges of getting your product information corpus to scale. We also understand the issues you face due to lack of standardization of category/brand taxonomies, the value you associate with precision etc.

While we cannot claim total victory over the nuances of the internet, we have managed to make a very significant dent in the problem. The below table lists the current corpus size of product information at Indix, the depth to our taxonomy, and above all, our ability to match different products across the expanse of the internet irrespective of how the hosts that harbor the products organize or render the same.

Today, thousands of enterprise and freelance developers leverage this big data corpus of product information in both a real-time fashion via Indix API and also via our bulk and feed output methods. They use Indix to get access to, slice, and extract standardized product information which then passes through their business logic to aid critical decisions. The onus of managing the flux, the freshness, and the structure of the underlying data from 2500+ web properties today is left to Indix.


Dirty Laundry Needs a Good Wash

If you consider the non-standardized, raw product data on the internet as dirty laundry, then Indix is the laundromat that turns this data into crisp laundry that smells like sea breeze. 🙂


The Indix Dirty Data Washing Machine

Indix collates product data from many sources like various web properties, data feeds acquired from retailers, brands, and other partners.

This huge load of data goes into the Indix spin cycle (elaborated below) which organizes the data around key pivots like a standardized brand dictionary, and a common category taxonomy across the sources. Next, it is refined to drop data which does not meet strict quality guidelines. Finally, this data is structured to a simple to understand JSON format and indexed.

At the other side of the index are three broad data access mechanisms. The first is a real-time API that allows a user to search and query the corpus by keyword, category, brand, price, store, UPC, and a host of other facets. The other way to access the product index is via a bulk API using which you can post a host of query/lookup tasks in bulk and be notified when the bulk job is completed. Lastly, we have a feed output mechanism with which one can tailor the output to get the rows and fields of interest from anywhere through the Indix product wash cycle.

A Peek Into the Indix Product Wash Cycle

Over the years, as we ventured deeper and broader into the product information space, we added one layer after another to allow us to organize, refine, and structure our ever growing product information corpus. Below is a zoomed-in view of the product information processing cycle that we have built and employ at Indix today.


Discrete Steps in the Spin Cycle and their Function

Seeder: The seeder generates a list of URLs that contain product and price information.

Crawler: The crawler goes and fetches the HTML page (containing product and price information) from the site and stores it in a file system.

Parser: The parser extracts all the relevant information from the HTML pages and stores it in our database. Information extracted includes title,list price, sale price, bread crumb, UPC, MPN, SKU, availability, shipping information, and product specifications.

Classifier: By applying a machine learning based algorithm, products are bucketed to categories in the Indix taxonomy. This way no matter from which source the data is ingested, it ends up getting bucketed into a standardized category taxonomy.

Attribute Extractor: This process extracts deeper attributes from the products. Attributes include brand, color, material, where manufactured, where designed, etc.

Matcher: The matcher uses a series of algorithms that logically reduce the data into smaller chunks. The algorithm’s intent is to differentiate between variants and match similar product variants across of all the sources that Indix either crawls or acquires via feeds.

Derived Attributes Generator: The Derived Attributes Generator calculates values such as average minimum price, average maximum price, whether a product is on sale, in stock, whether the product had a price change etc.

Indexer: The Indexer indexes the data so that it can serve people through different views. For e.g. there is a search index and an insights index.

Insights Generator: The Insights Generator understands and delivers insights based on which application is in use.

By leveraging an external source for well-formed, structured, and standardized product information at scale, you can focus your time on avenues that are more deserving of your time such as building and refining business logic and/or analytics which is based on this product information.

  Download the Pervasive Commerce White Paper
Get Demo

Leave a Reply

Your email address will not be published. Required fields are marked *