Here at Indix, we are building a comprehensive database of products. The word “comprehensive” means three different things. First, we want the database to contain information on every product in the world. Second, we want the database to have all the information on any particular product, and finally, we want the information to be up to date. Given the enormity of the task, the only way to solve this problem is to completely replace the manual collection or curation of data by algorithms and models. This leads us to the subject matter of this post, which is, what are the most impactful and challenging machine learning problems in this space?
Crawling a large number of pages spread across multiple domains with the page content remaining fairly static is somewhat of a solved problem. But when it comes to crawling retail websites, respecting the politeness delay with prices changing as much as five times a day remains an open problem. There are three parts to this where machine learning can make significant contributions.
The first is determining which prices are more likely to change than others. This way a given budget of HTTP requests over a time window can be effectively used to maximize the freshness of information. However, determining which prices are more likely to change than others depends on a variety of factors which are constantly changing (the category of product, the particular retailer in question, the day of the week, the time of the year and price movements in competitor prices, to name a few). While the multitude of changing factors makes modeling the likelihood of change hard, the most challenging aspect of the problem is that you don’t have full visibility on the price change. So, basically you know that the price of a particular product has changed when you crawl a page, but you don’t know how many times it has changed since you looked (crawled the page) the last time.
The second is determining if you are going to crawl the same content you have already crawled. Retail websites have a link structure that will cause your crawlers to revisit pages with the same content but with different URLs. So, a crawling strategy that simply follows all links in a page (keeping visited links in a distributed Bloom Filter) will lead to very little new information when crawling under a budget. Of course, you will know you have already visited a page by looking at the content but the challenge is to determine it before hand and not visit the page in the first place.
A more abstract way to describe the two problems above is to say that we want to determine the value addition of crawling a particular URL. Predicting which prices are likely to change and predicting if you have crawled recently, are parts of the same general problem: Determining the value addition of crawling a particular URL. There are some additional aspects to this issue. So, for instance, being able to determine if a link is pointing to a listing page or a product page or a page with the return policy can have a big impact on how effectively a crawling budget is used. A listing page will give you fresh pricing information on a number of products. A product page will give you pricing information, as well as other information on a particular product (the value of this information has to be considered, reviews get added and product information gets updated). A page detailing the return policy is simply wasting an HTTP request.
Once pages from retail websites have been crawled the next step in processing this data is the parsing of fields like title, sale price, list price, description, etc. from an HTML page. Product pages from many retail websites these days are rich in information. There can be as many as thirty important pieces of information on a product page from a retail website, all of which we want to capture. Conceptually, all this information can be captured if the right CSS selectors are identified for each field of interest. If this task was limited to a small number of retail websites, it would be feasible to manually write such parsers and update them when the layout of the website changes. However, given the large number of retail websites we want to process, it is not feasible either to generate or maintain such parsers manually.
This is an area where machine learning can make significant contributions. Predicting which DOM node on a page corresponds to which field of interest (like a title or price) with a high level of accuracy, remains an open problem. Such a model can definitely benefit by using visual features (position, size, color) extracted from the rendered HTML page (using a headless browser).
Given the extracted semi-structured information on a product page, the task of product categorization is to identify the type of the product. While this can be seen as a canonical multi-categorization problem, there are three key challenges. The first is that there are a large number of product types (going into hundreds of thousands) and the data available for certain products may be limited, noisy and entirely missing. Typically, data from online marketplaces tends to be of fairly poor quality. The second problem is that taxonomy of products is not well-defined (there might be overlaps) and is fairly dynamic (new products and product types are being introduced all the time). The challenge for machine learning is to automatically discover new product types, and re-categorize products as and when necessary into more specific product types. It is also important to leverage the diverse source of information (title, descriptions, images, related products, reviews etc.) to deal with the missing data, noisy data and class skew which make the problem fairly challenging.
Given the product type and all the semi-structured information from a product page, the task of attribute extraction is to generate a list of attribute value pairs specific to the product type. This involves processing a diverse set of data (descriptions, tables, reviews, images etc.). While a lot of attribute value pairs can be extracted from the tables on the product pages itself, standardizing the keys, values, and the units of the values is challenging. Certain attribute value pairs need to be derived from the descriptions and some from product reviews but the most nuanced of information needs to be derived from the product image. The hardest part of the problem is that the attributes are so many that it is not feasible to define the values of interest for each product type. The challenge for machine learning is to discover and standardize the keys and then to derive their values (with units) based on the diverse set of semi-structured fields.
Matching identical products from different retail website is perhaps the hardest and the most impactful problem in this space. The challenges in this problem emerge from noisy data, confounding ground truth and the sheer volume of the data. The confounding ground truth problem (like a pack of 3 pens vs. a pack of 6 pens or colour variants of the same product) wherein two products seem identical, is fairly ubiquitous when it comes to the matching problem. Like classification and attribute extraction, different types of data (images, descriptions, reviews and related products) needs to be leveraged to solve this problem well. The other more important problem is that of scale; even if we were able to design an automated way to match products, it is not feasible to compare every product to every other product. Also, new products and retail websites are constantly being added and the product matching has to be re-run frequently.
At Indix, we have made significant progress on each of these problems, and we continue to push the limits. Automation is at the central aspect of our engineering culture and this requires us to push the limits in applied machine learning.
Also published on Medium.