Hi. 👋
I'm a Software Engineer with Datadog, based in Munich, Germany. 🥨 Ex-Skyscanner.
If you found any of my work useful, you might want to consider buying me a coffee.
I occasionally write engineering blog posts. Sometimes my colleagues blog about our work, too.
How to improve the query performance for Skyscanner's OpenTSDB cluster and enabling queries that previously were impossible to serve by reducing the resolution of historic data.
👉 Roll up to speed up: Improving OpenTSDB query performance
Annette blogged about phantom alerts that our alerting solution Bosun would fire every so often, paging on-call engineers, but turn out to be false every time. The alert condition which was met and triggered the alert, would recover on the next evaluation, only split seconds later. Subsequent investigation and resubmitting the exact same query wouldn't show any sign of a problem, let alone the alert condition being met.
It had been annoying us for two years, but it also happened infrequently enough that investigation any efforts were regularly abandoned without meaningful results until years later. It was mysterious and interesting enough to still blog about it, though. Also, we really wanted to sleep comfortably again without being woken up by a false alert looming. The blog post describes the problem and in an addendum how I finally found the root cause.
TL;DR - Expand here to show the root cause if you don't like exciting stories
Our initial suspicion of a bug in Bosun turned out incorrect. When our timeseries database OpenTSDB serves a query, it uses 8 scanners to return all the required data from HBase asynchronously and proceeds to merge them before returning the result to the client.
The scanners write the results to a map. The datastructure used to generate the key for tese results, however, wasn't thread-safe and in a rare race condition could return the same key for two scanners which meant that one overwrote the other's results. Bosun had incomplete data and the alert went into an unknown state, paging the on-call engineer.
The unspectacular fix can be seen in OpenTSDB/opentsdb#1754.
👉 The problem that wasn’t there — and the Bosun alerts that were