In my previous post, we saw how Indix uses artificial intelligence and machine learning algorithms to classify (categorize) products across stores to a single uniform taxonomy. Today, let’s look at how we apply machine learning techniques to match them. Product matching refers to the process of determining whether a set of products sold across different stores is in fact the same or not. Example would be identifying an Apple iPhone 6S 32 GB Black Unlocked Phone as the same product across Amazon, Target and other stores. Here are a few common use cases that are enabled through product matching:
The fundamental challenge with matching is that there is no single source of truth. There are billions of products and there is no absolute structure that defines how products should be identified across stores. UPC (Universal Product Code) is useful, but majority of ecommerce sites do not expose it on their product display pages. This brings us to the next challenge which is the depth of product information across stores. Not every store has well-formed titles, UPC or GTIN, MPN (manufacturer part number) and other rich attributes to help with matching. The other key challenge is the scale. We have more than 1.5 billion product records and to identify a match, the given document (product record) needs to be compared across all stores.
For example, consider the following example: Matching the Nike Revolution 3 men’s running shoe across Amazon, JCPenney, and Zappos.
The product titles at Amazon and JCPenney are more similar compared to Zappos, which does not specify the gender and product type information in the title. The primary product images across the stores are also different. So even with images, it is difficult for a machine to determine that these products are the same. In this case, no individual signal can help us match these records across stores with confidence. Also, we would need deeper variant-level information like color, size, width, and gender for exact matching. There’s also the added challenge of the lack of standardized attributes – like the color Black being represented as BLK.
In the example below, even the product title information is completely different across stores. The first screenshot is from a marketplace where the title is small and the quality of information is very poor. However, using the image and attributes, we can still match these records.
Also consider the scenario above. Neither the product title nor image is the same between the two stores, making it very difficult to match the products based on available data. The first screenshot is from a marketplace overloaded with a lot of information and the second one is from a category-specific store. One site uses “CATV” as an acronym to represent Cable TV and “comp” for compression.
Let’s take a look at how we tackle this issue at Indix. The matcher needs more rich and deep information to determine whether two products are a match or not. At Indix, we primarily use the following signals:
We also process title+description text to extract additional product data like “size: 32 GB” (from title) and “display size: 5.5 inches” (from description). These attributes help us in matching as well.
Due to the lack of training data, we use an unsupervised learning algorithm to identify matches across stores. Given the scale at which we operate, the key to matching records across stores is to reduce the number of products that are compared in a given set. One of the first stages in our matching is blocking or bucketing where we reduce the number of products we are comparing across stores. Brand and Category data is used as a signal to bucket the products. Using the category-specific thresholds, we identify potential matches across stores. This includes similarities between title, UPC, MPN, images, and product attributes to identify potential match pairs.
Once the potential match pairs are generated, we do a union find to generate match groups. Usually at this stage, all the similar products are within a given match group. Using offline algorithms, we generate Must Link (ML) and Cannot Link (CL) constraints between the product records. MLs are generated in cases where the product document similarity is high and CLs helps us to eliminate wrong matches.These constraints are generated at a category level and all of these are used as signals by our fine matcher, which generates the final set of match groups. We primarily use rocksDb and Scalding for our matching. Almost everything happens in the reducer stage. The best part is that we are able to run this entirely in a matter of few hours.
In case of the matcher, we only have an offline prediction for the matches and the predictions are done on a weekly basis. We measure the quality of the matches using precision and recall metrics. Pairwise match clusters are generated across the corpus and sent for spot checking to identify the precision. Recall is difficult to get because of the lack of a single source of truth. In this case, we restrict the scope to a controlled set of products and stores and report our numbers with reference to them.
At Indix, we ensure our algorithms are optimized for scale and precision. We have made significant progress on these problems and we continue to push the bar on classification and matching accuracy.
This content was delivered as part of our webinar on using AI and ML in structuring product information. If you missed it, you can watch the recording here.
Also published on Medium.