The Indix Cloud Catalog is a comprehensive collection of structured product information. Previously, we hosted a webinar on how “dirty” data on the internet really is. Now, I’d like to give you some insight into how we go about cleaning that up and making it usable. The processes of gathering and structuring product data are based on three main pillars.
Ingress: Includes all the methods to gather data and add them to the Indix sub-systems. The two most widely used methodologies for ingress are crawling and feed ingestions.
Process: Data in its raw form is not consumable. Machine learning algorithms and other methods are used to extract brands and different attributes, categorize and match products, and so on.
Egress: This includes the ways to consume data from the Indix Cloud Catalog – real-time API endpoints, bulk API endpoints, feeds, and the Reports App.
Let’s talk more about the ways to get to the meat – Ingress. As mentioned before, we use two primary mechanisms to ingress data.
(i) Crawl Ingestion – Gather the data from a store’s website.
(ii) Feed Ingestion – An agreed upon web format to get content in a periodic fashion. Feeds are usually available from aggregators, brands or in some cases, the retailers.
In very general terms, crawling is more of a pull mechanism where the onus of navigating, gathering and structuring data is with the collector of data, whereas feed ingestion is more about mapping the data from the feed provider in the pre-agreed format to requisite structured format for consumption.
Let’s get a high-level view of the mechanism and the system-level building blocks that are required to get data in through crawl ingestion. At Indix, there are three building blocks that constitute our ingestion platform.
Extracts the list of all product URLs and listing URLs that contain product information from the stores. The scheduler, as the name implies does the job of scheduling these URLs based on feedback signals from other systems such as the matcher, freshness of data in the API, coverage gaps for the site, and in compliance with our crawl and politeness policy.
Fetches the product pages (HTML content) corresponding to the scheduled URLs from the target site, with traffic routed through proxy machines purchased from partners. This system also handles rendering of ajax based sites (where product information is fetched dynamically).
The HTML content from all sources is not in the same format, and this is where the parser comes in. It extracts the data from the HTML and stores it in the Indix product model schema.
The role of the parser is critical – even though standards like schema.org are emerging, they are at a very nascent stage. The data gathered is useful only when it is in a standardized format, so that it can be consumed by downstream systems.
Output from parser is written to two models – the price record and the product record. The product record has information that uniquely identifies a particular product – catalog information. This includes but is not restricted to product title, attributes, image URL, UPC, MPN, specifications, etc,. These do not change frequently over time. The price record has information such as price, availability, shipping information, ratings etc,. which change frequently and are not characteristics of the product, per se.
Additionally, one of the implicit assumptions when doing crawls at scale is that the crawling system adheres to a crawl politeness policy – in other words, resources of the store selling products are not hogged primarily for servicing the crawling system. In my next post, I will elaborate more on this and other challenges and issues in crawling.
If you want to watch a recording of our webinar on gathering and ingesting product data, click here now.
Also published on Medium.