GET DEMO Chat

 

 

Using Al & ML for Product Data Classification at Indix

Machine learning and artificial intelligence are transforming every facet of our life. At Indix, we use it to make product data more accessible and actionable. Our mission is to gather, structure, and provide access to the world’s product information, so everyone can act on it. Gathering or ingestion of the product data is done via crawl and feeds. At present, we have more than 1.5 billion product records in our system across different stores. This gathered data has various levels of quality and depth. Our AI/ML techniques help us standardize this data to a common format, which makes it more actionable. Today, I want to give an overview of how we use these techniques for the classification of product data.

The Indix AI and ML Data Pipeline

The Different Stages in Our Data Pipeline

  • Dedup – Sometimes, a store carries the same product but with different URLs. This stage ensures that there is a single reference for a product at given store.
  • Classify – Each and every store has their own product taxonomy. Our classification pipeline ensures that all the billion product records fall under a single taxonomy.
  • Extract attributes – Product attributes are not always present in a proper spec table. They are available as part of the title, description, and other types of information we capture from the product page. This stage enables us to extract category-specific product attributes.
  • Standardize – Every store we crawl has their own way of representing product attributes and values. In this stage, we standardize the attribute key and value to the Indix product schema.
  • Matcher – Oftentimes, the same product is sold across different stores with different information. This pipeline helps us match similar products across stores.
  • Aggregate – In this stage, we fuse product and price information across stores to generate a product catalog database.

Product Classification

Now that we have an overview of the pipeline, let’s go deeper into the classification process. The corpus of data we house in our Cloud Catalog is enormous. Annotating the product type correctly is key to structuring this data.

Here are a few common use cases enabled through proper classification:

  • Category navigation for stores: Customers who take product feeds from us, use our classification information to build category navigation on top of the data.
  • Search result filtering: If a consumer searches for products by keyword, categorization can be used to enable further filtering.
  • Product catalog: Categorization is critical for defining product attributes. For instance, the attributes of a shoe (upper sole material, width, color, etc.) will be different from the attributes of a laptop (RAM, hard disk capacity, screen size, etc.)
  • Product comparisons: Categorization helps to compare similar products by clubbing them into appropriate groups.
  • Analytics: Generating category-specific reports and trend analysis.

Classification Challenges

The primary challenge: there is no universal taxonomy to structure product data. Every store has a different product taxonomy with varying levels of depth. Some are generic and some are too specific (in case of category-specific sites – see example below). Also, on marketplaces, the same product can be a part of more than one category. A running shoe can be a part of the Shoes or Sports & Outdoor category. Sites do this for their own internal reasons, but it challenges the ability to derive a standard categorization.

The data quality and depth of information varies across products on marketplaces dealing with millions of products. In the example above, a women’s pump shoe is categorized as a running shoe. This is not anecdotal, but a rather common occurrence on marketplaces.

Classification – Taxonomy

When planning our Cloud Catalog, we explored product taxonomies like GS1, Google Product Taxonomy, Amazon etc. and finally decided to build our own. Today, we have 23 top-level categories and 7000+ leaf-level categories.

As this world cloud shows, Clothing & Accessories is one of the biggest ecommerce categories. Compared to Clothing & Accessories, categories like Video Games have much fewer products. One key thing to note is that the taxonomy is not fixed and is subject to change. As new categories of products emerge, we expand our taxonomy. For example, “tablet” was not a category proper, but today “tablets” is available as a leaf-level category across stores. As products evolve, we re-visit and update our taxonomy about once a year.

Classification – Input Signals

Almost every single piece of information available on a product page can be used as a signal to identify product classification.

  • Titles usually have rich information and unique features that help classify the product.
  • Breadcrumbs are the next best source for classification, but it’s challenging because a lot of sites don’t have them.
  • Product images are excellent for predicting the product category. The challenge is they can be rotated and have different angles. Getting a training set with all possible combinations is also a hurdle.
  • Descriptions and attributes are also helpful. A product with attribute or description including RAM memory or hard disk size can help us guide its categorization.

Classification – Input Signals

The above example is an ideal scenario where we have good product content, but unfortunately this is not the norm. We see varying degrees of information across stores with varying level of information depth.

Classification – Workflow

Above, you can see what our text classification workflow looks like. Over the course of time, we have built training datasets in the order of few millions. We pre-process and normalize this data. Some of the normalization processes include stemming, removal of stop words and tf-idf. We then split the data as train/test. Our model is trained using linear SVM where we predict one vs rest. We have built two levels of classification models.

  • Top-level classifier
  • Leaf-level classifier

The model training helps us predict on the test set and generate precision/recall numbers based on the confusion matrix. We use Sci-Kit Learn for the learning, and Bottle & Docker for predicting the classification. Docker helps us scale the prediction across the corpus.

We have both an online and offline prediction. For all the existing products, we have an offline prediction and only for the new records that are entering our system, we have an online prediction. We use an ensemble of the different predictions to choose the final leaf category for our products. The next key step after classification is to identify and match products sold across different stores. Watch this space for more on the matching story.

This content was delivered as part of our webinar on using AI and ML in structuring product information. If you missed it, you can watch the recording here.



Also published on Medium.

  Download the Pervasive Commerce White Paper
ai-ml webinar

Leave a Reply

Your email address will not be published. Required fields are marked *