Metrics and Tracking for Data Systems - Indix
GET DEMO Chat

 

 

Metrics and Tracking for Data Systems

How do developers verify that a piece of code is doing what they think it should be doing? The correct answer would be to write a proof accompanying the code. The problem though is that, as code changes, the proof becomes outdated. And, to top it, proofs aren’t easy to write and aren’t automatically executable by a computer. So, what we really need is some sort of executable proof. We strike a balance and write what we call tests. More specifically, unit tests.

Unit tests are an expression of intent, though not as foolproof as a machine verifiable proof is, but nevertheless they are good enough for most settings. The good thing with unit tests is that they never go out of sync with code. Rather, they tell you when they go out of sync. So every time a code changes, it forces you to change the test or vice versa. It gives us what we call a safety net for the original author’s intent.

Identifying the Issue

When we build “systems”, they are generally composed of multiple inter-connected modules or components. Unit tests aren’t effective enough to validate the functionality of these systems. The problem gets harder when the behavior of these systems changes when the data that is exchanged between them changes. Most of the practices of building better, reliable software, focus on ensuring that components which integrate and work as a single system have “integration” tests.

What we don’t understand is integration tests only cover the contracts, plus it’s painful to manage as components increase or change frequently. What we need is a way to assert within individual modules that the assumptions around the data hold true beyond their boundaries. For example, in our setting here at Indix, since we are dealing with product information, saying price is never zero could be an assumption for a module. However, it might be perfectly valid for the component that produces that information – in case you are wondering, there are products which are priced at $0.00 on the “internet.”

Our system deals with a lot of data. The data usually comes from different sources with varied levels of quality. The assumptions that we have around the data don’t hold true across various components. What we would really like to have is “unit tests” for the entire system from a data perspective. It would be nice to get some amount of confidence/safety for “data” that unit tests provide for “code”.

How do we manage to do that?

Stats

The moment we hear stats, most of us start thinking about response times, latency, system performance, CPU usage, etc. Well, we need those plus we need to have more information about the output produced by the various modules. We collect lots and lots of stats in every module and those stats allow us to establish the assumptions we have about the data. Every time we make algorithmic progress or run a data cleanup or heuristic updates, we have these before-after stats that tell us if we are heading in the right direction.

Most of our data pipeline is built on the MapReduce framework that runs either as micro or whole batches. As we were learning the ropes, we wrote stats jobs that ran on the output of each job and produced numbers that we used as a way to promote code plus data artifacts. This used to be easy when we had fewer number of individual modules, and the number of in and most of the stats we ran just counted this or that. We were really re-inventing these stats jobs for every module we had. As a first step, we wanted to generalize it a bit to be used across various modules, and input/output data formats.

Algebra

Our attempt at generalizing the various stats jobs into a library ended with the algebraic notion of monoids. Though it’s a pretty simple algebraic structure, it expresses most of our stats’ needs.

Before we go ahead and understand how monoids solved our problems, let’s look at what monoids are. To be precise, they are categories with a single object in it. That’s the categorical notion of it. The algebraic notion of it is even simpler and straightforward to understand.

Monoids are a set of elements with a binary operation which satisfies a set of properties – closure, associative, identity. Hang on, I’ll get to these shortly.

Here are some examples:- Colored hexagons (yay!!!) with binary operation “+” denoting “mixing the colors.

1. Closure – If you take two objects from the set and apply a binary operation to it, it has to result in an element within the same set.

In other words, if I take any two colored hexagons and mix them, I definitely would get another colored hexagon and not a colored circle. Here, the set is the universe of colored hexagons.

2. Associative – The order of passing the objects to binary operation is not important, as long as the sequence is kept intact.

3. Identity – There has to be a unique element in the set, which upon binary operation on any other element in the set gives back the same element.

There has to be a transparent hexagon in our colored hexagon universe to make it a monoid.

4. Commutative – The two objects I pass to the binary operation can be swapped and I still get the same object.

It’s not necessary for us to have this property to make it a monoid, but, it’s generally true in most cases.

It doesn’t matter if I mix red with green or green with red, I get a yellow hexagon, no matter what.

Why does this whole thing matter? Something that’s a given with most of our modules is the scale of data. We deal with really large volumes of data, so, crunching the required stats faster is very important. Monoids allows us to parallelize the operations, which is very essential for running on large amounts of data. The parallelization is enabled due to the associative property of monoids. It lets you partition your data any way you want and compute the result for parts and then combine them together.

There is a even simpler cousin of monoid which is just like monoid except it doesn’t need identity. It’s called the semi-group.

Some More Relevant Examples

Integers are monoids on sum and product, where the binary operation is addition with 0 as identity and multiplication with 1 as identity.

If we consider integers, binary operations like minimum or maximum is a semi-group but not a monoid in a puristic sense. However, we can take Int.MaxValue and Int.MinValue as identities respectively to make them monoids. In general, we can turn semi-groups into monoids by adding a “nominal” identity element.

Now that we’ve seen some examples, what do you think of averages? Is that a semi-group? If we have to operate with resultant values of average operation, it’s not. Why? Exercise: Check if avg(a, avg(b, c)) equals avg(avg(a,b), c). So, how do you express average as a semi-group? Hint: Average computation involves two components – see if they can be expressed as a semi-group.

Back again to our story, armed with this abstraction of monoid, we looked again at many of our jobs that churned out stats. In order to make computations simpler and more compose-able, all we had to do was write monoids for those and just allow developers to use it for their needs. These implementations were done keeping in mind the details of the MR frameworks as well so as to leverage them. The output from these stats jobs get pushed to influx, which helps us with historical trends and also helps to understand/identify anomalies.

So, what next? About a year ago, we started moving some of our batch systems into real-time streaming systems. One of the challenges with this apart from the core tech stack was the metrics and monitoring infrastructure around it. When “live” systems fail/deteriorate, we need to have ways to be notified about it “soon.” Compared to batch jobs, which clearly call out when they fail, streaming systems’ performance can go down silently without getting noticed.

To tackle this problem, we wanted to have stats “now” rather than “later”. Even if it was ONLY approximate. We built a real-time stats collection framework for some of the domain level metrics that we collect from individual systems in a generic way. We will talk about the high-level features, design, and deployment details in our upcoming blog.



Also published on Medium.

Leave a Reply

Your email address will not be published. Required fields are marked *