Machine learning and artificial intelligence are transforming every facet of our life. At Indix, we use it to make product data more accessible and actionable. Our mission is to gather, structure, and provide access to the world’s product information, so everyone can act on it. Gathering or ingestion of the product data is done via crawl and feeds. At present, we have more than 1.5 billion product records in our system across different stores. This gathered data has various levels of quality and depth. Our AI/ML techniques help us standardize this data to a common format, which makes it more actionable. Today, I want to give an overview of how we use these techniques for the classification of product data.
Now that we have an overview of the pipeline, let’s go deeper into the classification process. The corpus of data we house in our Cloud Catalog is enormous. Annotating the product type correctly is key to structuring this data.
Here are a few common use cases enabled through proper classification:
The primary challenge: there is no universal taxonomy to structure product data. Every store has a different product taxonomy with varying levels of depth. Some are generic and some are too specific (in case of category-specific sites – see example below). Also, on marketplaces, the same product can be a part of more than one category. A running shoe can be a part of the Shoes or Sports & Outdoor category. Sites do this for their own internal reasons, but it challenges the ability to derive a standard categorization.
The data quality and depth of information varies across products on marketplaces dealing with millions of products. In the example above, a women’s pump shoe is categorized as a running shoe. This is not anecdotal, but a rather common occurrence on marketplaces.
When planning our Cloud Catalog, we explored product taxonomies like GS1, Google Product Taxonomy, Amazon etc. and finally decided to build our own. Today, we have 23 top-level categories and 7000+ leaf-level categories.
As this world cloud shows, Clothing & Accessories is one of the biggest ecommerce categories. Compared to Clothing & Accessories, categories like Video Games have much fewer products. One key thing to note is that the taxonomy is not fixed and is subject to change. As new categories of products emerge, we expand our taxonomy. For example, “tablet” was not a category proper, but today “tablets” is available as a leaf-level category across stores. As products evolve, we re-visit and update our taxonomy about once a year.
Almost every single piece of information available on a product page can be used as a signal to identify product classification.
The above example is an ideal scenario where we have good product content, but unfortunately this is not the norm. We see varying degrees of information across stores with varying level of information depth.
Above, you can see what our text classification workflow looks like. Over the course of time, we have built training datasets in the order of few millions. We pre-process and normalize this data. Some of the normalization processes include stemming, removal of stop words and tf-idf. We then split the data as train/test. Our model is trained using linear SVM where we predict one vs rest. We have built two levels of classification models.
The model training helps us predict on the test set and generate precision/recall numbers based on the confusion matrix. We use Sci-Kit Learn for the learning, and Bottle & Docker for predicting the classification. Docker helps us scale the prediction across the corpus.
We have both an online and offline prediction. For all the existing products, we have an offline prediction and only for the new records that are entering our system, we have an online prediction. We use an ensemble of the different predictions to choose the final leaf category for our products. The next key step after classification is to identify and match products sold across different stores. Watch this space for more on the matching story.
This content was delivered as part of our webinar on using AI and ML in structuring product information. If you missed it, you can watch the recording here.
Also published on Medium.