Standardization in general ensures the smooth operation of processes and builds credibility over time. Best practices ensure efficiency and reduce redundancy. Today, the amount of data generated in the world is increasing by the second. As we collect and store this data, it is critical to standardize and normalize it for optimal usage. Otherwise, it can get noisy.
At Indix, we are building the world’s broadest and deepest product database. Our mission is to organize, analyze, and visualize the world’s product information so that everyone can act on it. As commerce becomes ubiquitous, there is an imminent need for a universal product catalog.
The goal is to have a unique identifier for every single product in the world. So whenever a product is looked up by an end user, all the offers and catalog data attached to that product shows up in the same place. Think of it as GPS coordinates for a product if you will.
Offers data refers to all the information related to the sale of a product – price, no. of stores where it is sold, promotion, channels, availability, etc. Catalog data refers to the relatively unchanging data related to a product like brand, dimensions, attributes, facets, specifications, etc.
One of the biggest challenges in building such a database is the lack of standardization when collecting product data from all over the Internet. Standardization is a must when trying to avoid redundancy and increase accuracy in matching products across stores. This a very hard problem to solve. Lack of standardization proves a hindrance to many business systems. All data needs to be converted to a predefined format, which requires domain expertise.
When collecting product data, both the key and value in an attribute key-value pair needs to be standardized. Depending on the category of products, the standardization requirements vary. Here are some examples to better understand this problem:
Take a bottle of shampoo for instance. Different websites may describe the weight of one bottle as 20 oz., 20 ounces, 1.25 lb, 567 gms, or 567 grams. The challenge is to recognize that a) these are all units of weight, and b) they are all the same quantity. Once this is determined, the value can be standardized according to a preset dictionary.
Another example of an attribute that requires standardization is size. The same women’s pump can be described as a US size 8, a UK size 6, or a European size 38 on different websites. All these sizes need to be standardized so that the matching is accurate. Without standardization, matching accuracy can falter. Once attributes are standardized, the confidence level of determining that these different instances are actually the same product, is higher.
Along with values, keys also need to be standardized. For example, “color”, “colour” and “Clr” all denote the same property related to a product. Similarly, the value associated with the key also needs to be standardized. So, Blk is the same as Black, and so on.
Standardization allows data to be used for extracting various insights, such as, does a purple shoe have a higher price premium over a red one? Or, what finish of lamps is more popular on the market right now? In other words, it makes assortment intelligence and market intelligence more easily accessible.
This kind of standardization is easy when somebody does it manually. But manual standardization is not possible when you are doing it at scale for every product in the online universe. There is where machine learning and product data science comes in. When dealing with data at this scale, we need to train machines to do the intelligent work. Sophisticated machine learning algorithms will help achieve this dream, and each and every product in the world will have a unique id and live in one universal Product Intelligence database.
Also published on Medium.