This hits pretty hard, but reasonates as true in many orgs.
Trying to see where else we have the same dynamic, CI came to mind.
CI is often pretty slow, and I've had jobs where it take 20 min for the full results (lint, compile, tests, etc) to come out. It was a pain point raised to management, we complained of lost productive hours against it, and tbe answer was to trim down tests and split the cose base, instead of "just" paying for faster CI instances.
I'd expect other orga to take more sensible choices, but in general getting budget for tooling feels hard.
The problem is that most people don't know any better. Right now I'm working on getting a greater-than-90-minute build down to ~10 minutes using distcc, ccache, and mold. It's painful to hear management ask "when is it going to be done and we can move on?" because that really telegraphs they don't value the increased quality of life this decreased cycle time will deliver to the developers and users.
I'm reminded of the cartoon with the caveman pulling a cart with square wheels, and guy waving circular wheels at him, only to be told "No thanks, we are too busy."
If you were using a service to run those builds, reduced build time would translate directly to lower costs.
Even if you are using a wasteful, low-utilization capital plant to run your build, more efficient use of those resources should become lower energy bills, at least.
The only way you get lower energy bills is by switching things off, not using them at full whack, but as I said, people don't know any better than the mess they're in.
the other side of this is that once you have the faster instances, the pressure will be off to do anything to improve build efficiency, and everyone will add more to the build till it's 20 minutes again.
Getting build times down is a difficult project because it's a cross functional task.
Like everything where you're involving a bunch of teams with no free space on their roadmaps, you have to be able to demonstrate that your team has done everything it can, and now the pressure needs to fall on the other teams. Like have a wiki or presentation to show how you have already trimmed the tests as much as possible, and the codebase is as small as possible, and you're not making 100 unnecessary docker layers, etc.
Thats your ammunition in the fight to get something like resizing the build instances on the other team's roadmaps and out of their budget.
I currently wait twenty minutes for CI to package a Go program that can be built on a laptop in ten seconds, because the “CI pipeline”—and every time I have to say that I choke on the words a little —involves twenty dependent steps, each of which installs Ubuntu. And this is a large, valuable company with thousands of engineers. I think having a development environment where the speed of the linker is even remotely relevant is a problem faced by a few special companies.
Sounds like it already is containerized, and that's the problem? A lot of places have CI setups that don't cache things anymore for various reasons. A while back there was a new YC startup announced whose product was essentially a SaaS for building Docker containers where they cache things for you.
Another cause of this problem is when CI workers are spun up and down on demand in the cloud, so VMs are constantly being set up from scratch.
The same reasons you can't just inline everything in a program, but break things into functions, modules, etc. Complex CI systems are, in my experiences, built up from components that are reused across multiple project builds and need to be encapsulated in a way that they can be plugged into any of those. Each component is containerized and caches as much as possible at each step, but that is still a lot of overhead.
> the pressure will be off to do anything to improve build efficiency, and everyone will add more to the build till it's 20 minutes again.
To rephrase this in a way that was not your intention, "developper laziness would only increase if the company paid for the beefier CI".
And I think that's often the crux of the discussion, at the base of it there's a perceived neglect and lack of effort, and it's only if all possible efforts have obviously been done that money should be attributed.
Thαt's where it feels less like a cut and dry ROI calculation, and there's a moral judgement aspect that is baked in but not often properly surfaced.
20 minutes is nothing. Some orgs have CI times where it takes hours or even days. And I'm pretty sure that all orgs go through the same set of tradeoffs and decisions, it's not unique to yours.
I was on the other side of that a few years ago as a tech lead/manager, and of course the team complained about CI speed because everyone always does in every software company. We staffed a team to work on build time improvements. It was the sensible choice. Why not just throw money at it? Well, because:
a. We'd already done that, more than once.
b. The tests weren't parallelized over multiple machines in a single test run anyway.
c. There was a lot of low hanging fruit to make things faster.
d. Developer time is not in fact infinitely expensive compared to cloud costs.
CI can easily turn into a furnace which burns infinite amounts of cash if you let it. Devs who set it up want to use cloud because that's hip and easy and you can get new hardware without doing any planning, but, cloud hardware comes with a high premium over the actual hardware. Optimizing build times feels non-productive compared to working on features or bugs. Also, test times just expand to fill available capacity. Why optimize, even a little bit, when you aren't paying for the hardware yourself? Better to close a ticket and move onto the next, which is transparent to managers and will look impressive to them. In contrast CI times are a commons and it leads to tragedy, where everyone complains but nobody will fix things as the incentive are mis-aligned.
There are often some quick wins. For non-urgent relatively stable use cases like CI it makes more sense to use dedicated hardware, as the instant scaling of the cloud isn't that important, but to a lot of devs (especially younger ones?) it seems like obstructionist conservatism to use that. They'd rather add lots of complexity to try and auto scale workers, shut them down at night, etc. Maybe dedicated machines are coming back around in fashion now, as the cloud premium gets to absurd levels. I see more talk about Hetzner than I used to. In my new company the CI server runs in Hetzner+own hardware, and that works fine. It also has the advantage that the hardware itself is a lot faster because it's not being overcommitted by the cloud vendors, so build times and tests will just magically get faster and (just as importantly) performance will get more predictable.
In other cases enabling caching and fixing any bugs that reveals can also be a big win. Again it can make sense, especially if this work can be assigned to junior devs.
Trying to see where else we have the same dynamic, CI came to mind.
CI is often pretty slow, and I've had jobs where it take 20 min for the full results (lint, compile, tests, etc) to come out. It was a pain point raised to management, we complained of lost productive hours against it, and tbe answer was to trim down tests and split the cose base, instead of "just" paying for faster CI instances.
I'd expect other orga to take more sensible choices, but in general getting budget for tooling feels hard.