hantusk's favorites | Hacker News

A few comments on some things I'm seeing in these.. comments.

- What about Cassandra/PostgreSQL/Redis?

One of the implications of "99% of data is never read" is that it's incredibly wasteful to keep it all in memory. You might be assuming that you are letting the data expire eventually, but I'm actually not; expiry is a secondary concern to storage.

Once you start to involve the disk (PostgreSQL and Cassandra), you start to get into locality issues that these databases weren't really designed for.

For a more concrete description, lets say I have 2000 machines. Our app is Python, so they run 16 dockerized processes each, with each container reporting 10 simple system metrics every 10 seconds. These metrics are locally aggregated (like statsd) into separate mean, max, min, median, and 90pctile series. I've not instrumented my app yet and that's already 160k writes/sec on average; if our containers all thunder at us it's 1.6m, and frankly this is the "simple" case as:

* we've made some concessions on the resolution

* we only have dense series

Anyone who has used graphite at scale knows that this is a really painful scenario, but these numbers are not particularly big; anywhere you could take an order of magnitude there are a few other places you can add one.

I'm also assuming we are materializing each of these points into their own timeseries, but that's more or less a necessity. It gets back to the locality issues; If we wanted to see the "top 10 system load per node", it's actually quite imperative that we aren't wasting IO cycles on series that _aren't_ system load; we need all we've got to handle the reads.

(As a side point, this is why people in the cloud are adopting so much Go so quickly; it's proven to be easy to write AND read, and also to reduce orders of magnitude in several of the dimensions above, eg. "we can use 1 process per box not 16 containers, and we can get by with 1000 machines not 2000." Having to write your own linked list or whatever doesn't register in the calculus.)

- 1s resolution isn't dense:

No, not always. It's hard to please everyone with these things. In my world, 1s is good (but not great), but 10s seems to more or less be accepted universally, and much sparser (5m, 30m, 6h) is not actually uncommon. At the other end of the spectrum, you can be instrumenting at huge cardinality but very sparsely (think per-user or per-session), perhaps only a few points per timeseries, and the whole of what I've described above kinda gets flipped on its head and a new reality emerges. For what I've described, I quite like the Prometheus approach, but for my very specific use case 1-file-per-metric only beats the filesize block overhead often enough for very long timeframes; to long.

- Why are all TSDB over-engineered?

I hope some of the above has made explicit some of the difficulties in collecting and actually making this data readable. I've only actually thusfar discussed the problem of "associate this {timestamp,value} pair with the correct series in storage"; there are also the following problems:

* you can't query 3 months of 1s resolution data in a reasonable amount of time, so you need to do rollups, but the aggregations we want aren't all associative so you have to do a bunch of them or else you lose accuracy in a huge way (eg, if you do an avg rollup across min data, you flatten out your mins.. which you don't want); this means adding ANOTHER few dimensions to your storage system (time interval, aggregator)

* eventually you have to expire some of this junk, or move it to cold storage; this is a process, and processes require testing, vigilance, monitoring, development, etc.

* you need an actual query system that takes something useful and readable by a human ("AVG system.cpu.usage WHERE env=production AND role=db-master") and determines what series' actually fall into those categories for the time interval you're querying. Anything holistic system that _doesn't_ do this is an evolutionary dead end; eventually, something like Prometheus or Influx will replace them.

These are minimum requirements once you "solve" storage, which is always a very tricky thing to have claimed. If you get here, you've reached what decent SaaS did 4 years ago and what very expensive proprietary systems handled 10 years ago.

- What about Prometheus/InfluxDB/Kdb+/Et al.

Kdb+ is very expensive, its open source documentation is difficult, and its source is unintelligible. It is basically from a different planet that I'm from. Even recently, when I encounter people from, say, the C# world and tell them I work with Python and Go, they ignore Go and say "Wow, there are like no jobs for Python", which I find utterly bewildering. Of course, I never encounter any jobs using C#, either. This is how little some of these spheres overlap sometimes. Someone from the finance world is going to have to come in and reproduce the genius of K and Q for us mortals in a language we understand.

As for Prometheus and InfluxDB, I follow these more closely and have a better understanding of how they operate. I think that they are both doing really valuable work in this space.

From a storage aspect, I think the Prometheus approach is closer to the one that I need for my particular challenges than the InfluxDB one is, and in fact it looks a bit like things we've already had (but also Catena, Parquet, et al..) For most people, storage actually isn't important so long as it's fast enough.

And this is kind of the point of my article. There's starting to be a bit of a convergence among a few Open Source TSDB, and I've taken I've tried to highlight some issues in those approaches and suggest that there's room for improvement. I have my own ideas about what these improvements might look like based on my work at Datadog, and once they're proven (or even disproven) they'll be published somehow.