Does your company make Docker images? -- They start at somewhere around 1GB a pi...

cmeacham98 · on Feb 28, 2023

If you only have a 5 person team of SDEs, then your company is within the "small businesses" that might not be able to store everything like I mentioned.

Also, if you aren't deduping Docker images by layers, you're doing it very very wrong.

crabbone · on March 1, 2023

Where did I say that the company has only 5 people?.. I gave it as a unit of measurement. Typically, hierarchies of s/w companies build on small-ish teams, in my experience, 5 would be the average team size. While there's usually some way to aggregate across teams (in terms of resources), largely, the resources a team needs are independent of other teams.

In other words, if you have a team use X resources, then, roughly, two teams will use 2X resources. But resources per programmer don't make sense in the same way how man-hours don't (because you get fractional people).

And, the truth is that even though large businesses have a potential to save during aggregation, developing this ability is yet another cost that they need to pay all the while the business is running. And the more efficient they want this aggregation to be, the more they need to invest into it. Which means, that large businesses are bound to be more wasteful than small businesses.

So, to give a more concrete example of why there's tendency in resource waste to be of the form:

for N teams, where resource use would be X per team, the cumulative resource use is (N + e)X, where e is small, but is a function of N, perhaps, something like log(N).

Suppose your company's artifacts are PyPI-style Python packages (hopefully, Wheels). For a very small company, you might get away with just having a Git repo with your code, and don't even produce Wheel: just do code checkouts, add current source to PYTHONPATH, and you are good to go.

Suppose your company now grows another team that needs to use the code written by the first team. Well, now they need to agree on their artifacts, the format, the location, so that two teams may independently create and access artifacts. The packages are relatively small and don't require much auditing of external dependencies though, so they simply pay for a small cloud instance to store their packages (eg. use GitHub releases or foundries or w/e it's called).

Suppose your company grows to ten or so teams. Now you start getting conflicting requirements between individual artifacts of different teams, you can no longer use simple foundries provided by services like Github to store your artifacts because you discovered you also need to use patched versions of third-party dependencies, and you have increased audit requirements. You now buy services from something like Artifactory, where you store and mirror a whole lot of Python packages you don't develop.

And, as you grow bigger, you'll realize that solutions like Artifactory also have their limits. At some huge scale, you'll get into business of running your own datacenters, and, perhaps, even your own power plants to power those datacenters. At each level on this growth path, you'll have to adjust your infrastructure to generate less waste compared to what would've happened if you tried to use the methodology you used at the previous level + the infrastructure necessary to bridge between multiple groups, however, each such step will never completely eliminate the price of aggregation, it will simply attempt to make it manageable.

----

In other words: a price of servicing of 1GB of storage for Google is lower than it would've been for an small startup, but they also need a lot more storage to manage per team than a small startup. This is why someone like Google would be in a good position to sell their services of managing storage for small shops, for example.