In the first two blog posts of this four part series, I talked about (i) the Indix crawl infrastructure for collecting product data from the web, and (ii) the various critical components of product data. Different pieces come together in helping us build our Cloud Catalog. In this post, I will go a little deeper into why crawling as an infrastructure for collecting product data from the web is a highly complex engineering process. I will also touch upon some of the mechanisms that we use at Indix to tackle these challenges.
An ecommerce store, just like any hosted website might have certain limitations on the number of concurrent requests that it can handle so that the people browsing the website can have the optimum viewing experience (time to load etc.). However, spiders on the internet need to crawl the web pages so that they can be indexed in search engines etc. This forms a core piece of the whole discovery experience for online stores. Typically, online stores have robots.txt files through which they publish certain directives on policies that need to be followed when crawling the target store.
In addition to ensuring that we do not send too many requests to the target store, a crawling system also needs to ensure that requests are spaced out judiciously (politeness policy) so that massive load due to concurrent requests does not cause service degradation at the target store’s end. Indix considers multiple signals such as traffic to site (number of page views), average response times, configuring sufficient crawl delay, judicious number of agents etc. as part of developing the crawl policy.
Not adhering to crawl policy and politeness delays, at the extreme can even lead to legal implications, and at the native stage can lead to the site blocking your crawls. So it’s very important for you to have a robust infrastructure that takes feedback from the site and enables you to modify your crawl policy at a store level.
Let’s take a peek into how we tackle some of these issues.
Indix has an adaptive crawl feedback mechanism that considers the multiple signals coming out of previous crawls to the corresponding store to dynamically change the settings for subsequent crawls to the same store. Our algorithmic scheduler ensures that response codes and response times from crawls are accounted for while arriving at optimal setting for the subsequent crawls for the store.
It has to be kept in mind that the crawls done by Indix are also dependent on the technology used by the target store. Fetch times differ wildly, ranging from a few microseconds to multiple seconds in case of Ajax-based sites. In addition, even inside any particular store, there may be multiple templates that need to be handled so that product data from all the pages is extracted and indexed in our Cloud Catalog.
The additional complexity on top of this is that the actual product page data changes frequently (information like price and availability are prone to changing more frequently than others), and this rate of change is not the same across stores. We need to have scheduling algorithms in place that cover both ends of the spectrum – i) not wasting crawls by fetching the unchanged product page frequently, and ii) crawling the product pages that are prone to change frequently so that Indix keeps abreast of the changes at the target store’s end. This ensures that the data in the Indix Cloud Catalog is kept as up to date as possible.
Site patterns (for storing and displaying product information) also change frequently and while building maintenance systems that monitor the success of crawls across these stores, we need to factor in the potential breakages caused due to site pattern changes and have mechanisms to heal as fast as possible.
In the subsequent and final post in the series, we will touch up on the topic of how we map the data collected from these stores to a Unified Product Record format, and how we handle data quality issues.
If you want to watch a recording of our webinar on gathering and ingesting product data, click here now.
Also published on Medium.