poniedziałek, 22 stycznia 2018

Dropwizard Metrics - considered harmful?

I admit, the title sounds like a clickbait. Dropwizard Metrics is a pretty nice library. What is actually harmful are your assumptions about its functionalities. Are you familiar with Dropwizard’s reservoirs implementation? If not, let me start with an example:

First, I’m trying to generate some data set, with a lot of values indicating normal application behaviour (from 0 to 100), some standard peaks (1000) and few really high peaks (10000). After that, I’m using Timer to get the percentiles (the only reasonable statistical property when it comes to performance). The results are as follows:
Crazy, right? The same data set and different results each time! Why is Dropwizard lying to me about my data?
In fact, it is working as expected. By default, Timer is using ExponentiallyDecayingReservoir - a very fancy algorithm to “produce a statistically representative sampling reservoir”. In reality, it will randomly choose some subset of our data and calculate the percentiles (and other properties) based on this subset. Why is this problematic? Well, check the results from the second iteration. There are no signs of performance problems, but we have 110 samples which clearly show that there is something wrong with the application. Maybe it is normal and could be ignored. However, I prefer to make this decision consciously and be aware of such behaviour. Okay, so I just need to change the Reservoir implementation, right? Yes, but currently, all implementations could ignore or “hide” some samples.

Calculating the percentiles seems like a trivial task. The real challenge starts when you need to support a dynamic range of values with high precision and small memory footprint. As usual in IT, you cannot get all 3 properties at the same time. Fortunately, there is a tool which made an acceptable compromise for this problem. HdrHistograms is my weapon of choice when it comes to performance measurements. You can read an excellent post about it here.
After filling Histogram with the dataset:
percentiles will always be the same (for the same dataset), and definitely no sample will be ignored.

There is a cost to this method and it’s accuracy. Maximum time is 16383 instead of 10000, because of the dynamic range precision which scales buckets differently, according to the recorded values. In most cases, I’m interested in knowing that the application had a hiccup and it’s actual extent might be an approximation.

Brak komentarzy:

Prześlij komentarz