The Indix Cloud Catalog is the world’s largest collection of programmatically accessible structured product information in the cloud. It covers close to 25 verticals and translates approximately to 7,500 sub-categories. Every product that we carry in our database gets stamped with information about the “category” it belongs to. This problem of classifying a product into a particular category is very important to serve various use cases – like, helping search, performing product matching, providing category specific insights, and so on.
Indix houses products across a wide spectrum of verticals – ranging from apparel, outdoor, furniture, and automotive products, to art-related items and books. The challenges in “stamping” every product that we collect with a specific “category” are both in terms of the breadth of the category hierarchy – which we call taxonomy, and the store from which we collect the data. It’s important to note that new products keep getting added, as do new types of products. And, to top it all, each store publishes the same product in different ways.
Broadly, the set of problems to solve are:
– What does the taxonomy look like – who owns and keeps track of it?
– How do we classify a product into (one or more) node(s) in the above taxonomy?
Right from the beginning, we knew, this needed machine learning, and, we had to do it at scale – both for the training phase and for the prediction phase. The classifier system at Indix has evolved during the course of the last few years. We are going to explore various aspects of that evolution through a series of posts here.
Part 1 – Provide an overview of the problem and things that we’ve done until now.
Part 2 – Dive deeper into various parts of the puzzle and then provide a block-level peek into the implementation and productionization.
Part 3 – Challenges in productionizing and maintaining large hierarchical taxonomy-based systems in production.
Solving the classification problem via machine learning involves a very laborious and expensive process of collecting training data. Collecting training data for such a large scale classification challenge is a hard problem in itself. Generally, this is done in one of the following ways.
We’ve taken a hybrid approach of choosing an existing corpus and overlaying some rules that fit our understanding of the domain to come up with a bigger corpus of data for our model building phase.
Initially, to get our systems bootstrapped, we found writing rules to be the most optimal way. For the problem space and for the kind of the data that we handle, our rules were nothing but regular expressions. A two-member team was responsible for building up this rules base and constantly monitoring and fixing it. Initially, we found this to provide us with very good results, but, as we expanded to more categories, we faced a lot of issues with this approach.
Within a few months, we hit the ceiling with what we could get out of the rule-based classifier. With the above set of learnings, we started evaluating next steps for the classifier. While we had a decent idea of how to approach it from a model building perspective, we were really short of training data. So, we ended up spending a few weeks coming up with a practical way to generate reliable training data. We did some cheeky automation to get a size-able volume of training data. What this meant was that our existing taxonomy became redundant. i.e., if we had to productionize a classifier based out of this new training dataset, we would have to introduce a fully new taxonomy. We took this plunge, as it was the right thing to do.
Training dataset size: ~10 million records
Number of labels: ~8,000 leaf-level nodes
Taxonomy by its very nature is hierarchical. One of the key decisions we had to make when building this version was to decide on how we wanted to build the classifier. There were multiple approaches possible:
We decided to go with the third approach, as that would strike a reasonable balance between delivering the solution, model complexity (or maintenance), and debugging. We used scikit-learn for the model building. The learner was built in a generic way so that we could define the learning strategy using one of the three approaches mentioned above and feed it into the dataset. The system would take care of preparing the dataset for model building and validation, build, and generate precision/recall and confusing pairs as reports.
The model size was ~1 gigabyte given that each of these models were One-vs-Rest. The size of the model was important to us because we expose our models via a microservices architecture. And our model of deployment was to co-locate our machine learning services with the worker nodes of our Hadoop infrastructure to reduce n/w hop when prediction calls. After a couple of runs of training and evaluation, we arrived at a respectable F-Score (~0.7) for the top-level classifier and we saw high variance (30%) for the F-score on the leaves. Compared to our current classifier, this was definitely a step in the right direction, so we went ahead and productionized it.
We did find a lot of issues with this version of the classifier, and made some fixes and model improvements. However, that’s a story for another time. Keep watching this space as will cover these topics in upcoming posts.
Also published on Medium.