Analytics at Indix - Part 1 - Indix



Analytics at Indix – Part 1

How Our Insights Engine Evolved Over Time

Like all startups, Indix started small. We had simple requirements. Then, customers came in and accelerated our evolution process. This three part series (similar to human evolution from Homo Habilis to Homo Erectus to finally, Homo Sapiens) will elaborate on the experiments we did while developing the analytics platform at Indix over the last two years.


What Do We Do?

At Indix, we are building an infinite product catalog. We collect and process all data about products – information that changes frequently (like price, availability, promotions), and information that doesn’t change (like colour, size, specifications, etc.). All this information is surfaced in different ways (read apps) to our customers. We have our flagship product intelligence application which exposes different insights to our customers (retailers and brands). We also have an API which exposes similar information to other third party developers. These systems are backed by an Analytics store, which will be the main topic of discussion here.

About Our Data

After we crawl and obtain the data, our data processing pipelines process and structure them for consumption. The granular data can be looked upon like structured data in a giant table where each row in the table is a representation of a product item we call variant. For instance, an iPhone is considered a product while a 32GB white iPhone is a variant. Each row in the table has three primary dimensions i.e. Brand, Store and Category, and some measures like list price, sale price, availability etc. They also have other attributes of the product like colour, size etc. Each entity (variant, brand, store, category etc.) is assigned a unique ID across our entire system. These are assigned by our data processing pipelines.

Here is an example of data for a product which has three variants (color) and is sold in three different stores (S1, S2 and S3)


Stage 1: Homo Habilis (HBase backed Analytics Engine)

The initial requirement was simple. For a given dimension slice, we had to come up with a result that shows the aggregation of measures, a pure traditional OLAP. Our primary dimensions were brand, store, category (hierarchical) and primary measures were prices (list and sale). Although we had more, we will limit ourselves to these in this post for brevity.

We could have started with our good old friend (RDBMS) and a OLAP system like Mondrian on top of that, since our data pipelines were using Hadoop and HBase was fitting nicely on top of that. Also HBase, as we all know, is good at GETs on large distributed tables. There were some implementations of HBase based OLAP stores already (like DataCubeHBase Latticeolap4cloud). They were either early in the development stage or abandoned or did not support hierarchical dimensions. So, we chose to build our own.


We had two major HBase tables. The input table (called priceDNA table) had the most granular items. This table was populated using the output of our data pipeline. Think of this table as a cube where each row is a product variant and every column is some information about that product variant. The Z axis is the timeline, the information about that particular variant captured over time. The OLAP table can be thought of as a pre-calculated materialized table that was created using MR jobs. It had the derived information for all available combinations of the primary dimensions. This table was called the Insights table.

Unlike RDBMS, HBase is a key value store. It is efficient for doing random GETs, where the default access mechanism is to specify a key and HBase will fetch the respective value. In the best case scenario, the result of a query is a plain GET. But to support complex query types, we had to resort to HBase scans. HBase has a battery of so-called filters. Create a scan object, add a filter, and HBase will return all the rows which match the filter. To efficiently pick the value for a complex BCS combination, we used two well-known tricks from the HBase world. The first one is that we stored the dimensions as composite keys, and secondly, we stored multiple combinations of these dimensions. As the above figure shows, we had seven combinations: BCS, CB, SB, CS, B, C and S. The aggregates were calculated for all these seven combinations and stored in the insights table. So our composite key structure looked like this:

  1. bcs { brandId, root categoryId, storeId
  2. cb { root categoryId, brandId
  3. sb { storeId, brandId
  4. cs { root categoryId, storeId
  5. b { brandId
  6. c { root categoryId
  7. s { storeId

Where ‘{‘ is a delimiter character to help in constructing HBase prefix scanner.

Given a query, we would determine the best combination to look for and construct the scan and filter objects appropriately. If it was a single dimension query (e.g.: brandId = 10, categoryId = 10001, storeId = 100) we would always pick keytype 5, 6, or 7 (as above) respectively. If it was a complex query involving more than one dimension (e.g.: storeId = 100 OR storeId = 101) AND brandId = 300), we would put a PrefixFilter, add start and stop keys, add appropriate mask for a FuzzyRowFilter and aggregate the results obtained from the scanner.

Category dimension is hierarchical. For e.g., Electronics->Telephones->Mobile Phones. To express these in a key value store is tricky. The Category ID which was expressed in composite key above is the root category. The leaf categories under this root were pushed into columns. Basically, we prepended the column names with the category tree. This was done because the next fastest way to identify a value (after key) is to have it as part of a column name. We used ColumnPrefixFilter along with the above filters when we wanted to pick values for categories other than the root category.

What We Learned

  • HBase related:
    • Pre-split regions and load the data using HFiles for fast loading times.
    • Make in-memory = true for HOT Insights table.
    • Use stop keys in scans for early termination of scans.
    • Disable HBase write cache when running MRs (initialversion was not using HFile based loads).
    • If possible disable major compactions and trigger it manually when you expect the traffic to be low.
    • Avoid common problems and performance issues by understanding HBase book.
  • Deployment related:
    • Use compression (we used LZO) for Insights table that has not so hot data.
    • Do not run zookeeper in Region servers. If possible run them on separate machines.
    • Separate insight table for HOT data (latest day) and NOT data (last X days) so that they can be tuned differently.
    • If you are using common zoo quorum for multiple services (HBase, Solr, etc), make sure to have sane timeout values to satisfy all systems
    • Run HDFS DataNode and HRegionServers in same node and enable short-circuit.
  • Others:
    • Always have a command line utility that gets your query, uses the same scanner logic and print the result in your expected format (HBase stores everything in bytes).
    • HBase had a bug which was not deleting the data from HDFS if a column family was deleted. We submitted a patch.
    • Don’t shy away from the latest version of HBase. Look at the release notes and make your decision.

Time to Evolve from Homo Habilis to Homo Erectus

Everything was fine until one fine day, we had our first customer. They started giving us valuable feedback and more work. Requirements started to evolve and it went like this.

  1. Display the actual products for a given BCS combination.
  2. Sort the products in that combination based on a derived attribute.
  3. Ability to create their own product groups (really sparse dimensions, similar to a Spotify playlist).
  4. User configurable derived attributes.

Basically all these requirements boiled down to one simple thing – No pre-computes. Calculate all the derived attributes on-the-fly based on the dimensions queried. It meant that we cannot have an Insights table and depend on it. We had to find a new way to scan the actual fact table and do aggregates on the fly, within a sub-second.

The knee-jerk reaction was to see if we can use HBase to do that. Just to let you know the scale, if a user enters a “store page” in our app which has X million products, we basically have to run a full table scan on all the rows on HBase and aggregate values. And we needed to do nine such aggregations for rendering a single page in a sub-second. Getting all the records out of HBase and doing aggregation on the client side was out of the question. We tried co-processor based aggregation to see if we can achieve on-the-fly aggregations. We used the default AggregationClient co-processor which comes with HBase along with some custom implementation of Column Interpreters like DoubleColumnInterpreter (Similar to LongColumnInterpreter). But it didn’t cut (even with in-memory = true along with slab cache for the insight table) and we quickly realized that we should step back and rethink the process. Another problem we tried to solve using co-processor was to get the top N items (based on a derived attribute) for a given dimensions combination. Again, that too was very expensive to run a full scan on HBase tables.

Leave a Reply

Your email address will not be published. Required fields are marked *