> That comes out to an estimate of ~30k machines with ~2M total cores and this reverse proxy being responsible for ~20% of CPU time on average for every single machine (40k CPU cores from the article divided by 2M total cores across the network).
I think you wanted to say 2% of CPU time on average and not 20%, no?
> In fact, the server that's running the proxy is also the server that's handling the inbound request as well as the Workers runtime, Cloudflare tunnels, firewall product, image optimizations, DDOS protections, etc etc etc & any other internal software they have to run to maintain their network (observability, keeping track of routing graphs, etc).
Sure, they have plethora of software that they need to run but that wasn't my point. My point was rather that I cannot imagine that a single VM that runs the low-latency service such as the proxy is also running any type of heavier weight services. Thus an argument of freeing up the CPU cycles for some other software when in reality there probably isn't any didn't make much sense to me.
This would be a legitimate argument if there was a challenge to begin with. Such as "we noticed that our 64-core 128G VM that runs the proxy is becoming saturated and thus the tail-latency of our XYZ service running on the same VM is starting to degrade". Without the problem statement "freeing up cycles for other software" is otherwise purely theoretical goal that is difficult if not impossible to quantify.
> And yes, it's a micro optimization but clearly helpful at scale when you add up many teams doing micro optimizations.
Perhaps in companies such as Google where they have a dedicated team developing fundamental low-level data-structures and/or runtime which is then used by 1000's of their other internal services. I believe that's actually one of the reasons why GoogleBenchmark was designed. In majority of other companies micro-optimizations are only a very good brain gymnastics exercise, I think.
> For example, here's a sqlite blog post [4] explaining how micro optimizations that don't even show up in real world systems (unlike this one which does) over an extended period of time yielded an overall 50% improvement.
The hypothesis of
"100s of micro-optimizations add up to an overall 50% improvement"
is equally, or mathematically speaking less likely (due to the Amdahl's law), than the hypothesis of
"there were a few changes but we don't know what are the exact changes that contributed to the majority of 50% improvement"
I intentionally say hypothesis because that's what it is. There's no evidence to support either of the claims. If I wanted to support my time spent working on the micro-optimizations at the XYZ company, this is also how I would frame it. This is not to say that I think that the author of sqlite was trying to do exactly that but only that the reasoning outlined in the post is not convincing to me.
That said, reducing the CPU cycles for certain workload by ~30% is a noble and quantifiable artifact especially if you're a database company hosting thousands of your database instances since this obviously directly translates to more $$$ due to the better hardware utilization.
However, trie-hard micro-optimization is done in the software that constitutes 2% of their total CPU time (if your ballpark estimates are somewhat close). Since the micro-optimization cut down the CPU utilization from 1.71% to 0.34% of 2% of 100% of CPU time this, in other words, means that total fleet-wide net of CPU utilization went from 0.000342% (before the change) to 0.000068% (after the change).
> My point was rather that I cannot imagine that a single VM that runs the low-latency service such as the proxy is also running any type of heavier weight services
I encourage you to imagine harder then. Nearly all of the software is running bare metal and as I said it’s all replicated on each machine. So the same server is running the reverse proxy and the CDN and the Workers runtime and Cloudflare tunnels because the any cast routing they use intentionally means that every single inbound request always hits the nearest server which generally processes the entire request right on the machine and uses Unimog to distribute the requests within a data center [1] and Plurimog [2] to distribute load between data centers. I’ll repeat again - every server runs the proxy along with every other internal and external product and there’s no VMs or containers*.
I’m not going to engage with the rest of your hypothesis that it’s not worth the effort until services are falling over or it’s the root cause for some tail latency since the team has indicated it clearly is worth their time and value to Cloudflare.
* technically there are VMs but it’s not worth thinking about for this specific discussion since the VM is closer to being just like any other application. It’s not virtualizing the cloudflare software stack but just providing a layer of isolation.
I think you wanted to say 2% of CPU time on average and not 20%, no?
> In fact, the server that's running the proxy is also the server that's handling the inbound request as well as the Workers runtime, Cloudflare tunnels, firewall product, image optimizations, DDOS protections, etc etc etc & any other internal software they have to run to maintain their network (observability, keeping track of routing graphs, etc).
Sure, they have plethora of software that they need to run but that wasn't my point. My point was rather that I cannot imagine that a single VM that runs the low-latency service such as the proxy is also running any type of heavier weight services. Thus an argument of freeing up the CPU cycles for some other software when in reality there probably isn't any didn't make much sense to me.
This would be a legitimate argument if there was a challenge to begin with. Such as "we noticed that our 64-core 128G VM that runs the proxy is becoming saturated and thus the tail-latency of our XYZ service running on the same VM is starting to degrade". Without the problem statement "freeing up cycles for other software" is otherwise purely theoretical goal that is difficult if not impossible to quantify.
> And yes, it's a micro optimization but clearly helpful at scale when you add up many teams doing micro optimizations.
Perhaps in companies such as Google where they have a dedicated team developing fundamental low-level data-structures and/or runtime which is then used by 1000's of their other internal services. I believe that's actually one of the reasons why GoogleBenchmark was designed. In majority of other companies micro-optimizations are only a very good brain gymnastics exercise, I think.
> For example, here's a sqlite blog post [4] explaining how micro optimizations that don't even show up in real world systems (unlike this one which does) over an extended period of time yielded an overall 50% improvement.
The hypothesis of
is equally, or mathematically speaking less likely (due to the Amdahl's law), than the hypothesis of I intentionally say hypothesis because that's what it is. There's no evidence to support either of the claims. If I wanted to support my time spent working on the micro-optimizations at the XYZ company, this is also how I would frame it. This is not to say that I think that the author of sqlite was trying to do exactly that but only that the reasoning outlined in the post is not convincing to me.That said, reducing the CPU cycles for certain workload by ~30% is a noble and quantifiable artifact especially if you're a database company hosting thousands of your database instances since this obviously directly translates to more $$$ due to the better hardware utilization.
However, trie-hard micro-optimization is done in the software that constitutes 2% of their total CPU time (if your ballpark estimates are somewhat close). Since the micro-optimization cut down the CPU utilization from 1.71% to 0.34% of 2% of 100% of CPU time this, in other words, means that total fleet-wide net of CPU utilization went from 0.000342% (before the change) to 0.000068% (after the change).