> For us, knowing immediate who should have had data on last scrape but didn't respond is the value.
Maybe I don't understand your use case well, but with tools like Riemann, you can detect stalled metrics (per host or service), who didn't send the data on time, etc.
> What mistakes are you referring to?
Besides scaling issues and having a simpler architecture, in Zabbix's case, there were issues with predictability: when the server would start to pull the metrics (different metrics could have different cadence) when the main Zabbix service was reloaded, had connection issues, or was oversaturated with stuck threads because some agents took more time to respond than the others. This is not only Zabbix-specific but a common challenge when a central place has to go around, query things, and wait for a response.
> you can detect stalled metrics (per host or service), who didn't send the data on time, etc
I guess the difference here is that we leverage service discovery in Prometheus for this instead of having to externally build an authoritative list of who/what should have pushed metrics.
> <...> and wait for a response.
As opposed to waiting for $thing to push metrics to you?
I guess I'm not convinced that one architecture is obviously better? There might be some downsides to a particular implementation but generally they both work and only external constraints will dictate which you use? E.g.: if you're required to ship metrics to multiple places, pushing to graphite and datadog becomes easier.
Anything that _should_ be scraped is tagged a certain way and anything that doesn't respond to a scrape gets flagged. After a few flags, an operator is paged. When $thing is destroyed or re-provisioned, different tags lead to a different set of $things to scrape metrics from.
For us, knowing immediate who should have had data on last scrape but didn’t respond is the value. What mistakes are you referring to?
( I am genuinely curious, not bating you into an argument! )