Does your company make Docker images? -- They start at somewhere around 1GB a piece. If, say, you use them for testing, you will have to create roughly an image per commit in your repository. 1K commits per projects -- that's about an output of a small (<5) team working for under a year. We are already at 1TB. Now, these images generate logs when run... Some of them may generate entire test datasets for testing... And we haven't even touched the data we want to store yet!
Now, consider this:
You need to store your Docker images somewhere, right? The registry. This registry needs backups every now and then, otherwise it will... well, one day you'll lose data, perhaps the entire registry full of images.
So, you might think incremental snapshots is the answer, but then there's problem with incremental snapshots: what if one of them is damaged? So, you'd want independent snapshots... And then you want to store them in something that's resilient to failure and can have its parts replaced on demand, so, something like RAID5, so that also adds something like 30% to your storage.
Another thing to consider: what if whatever you want to store isn't composed of independent images that can be reproduced on demand... what if, it's like a Web server with a session? Well, then you might need consistency groups that store your container's state together with the state of the database and whatever else that is using across, potentially, multiple computers... Oy-wey! Now we are looking into buying enterprise-level storage system from something like NetApp... We are probably spending six digit figures only on this single Web server with a database sharded into three servers on yearly basis... And we still don't have geographical redundancy for contingencies like datacenter being hit by a tsunami etc.
If you only have a 5 person team of SDEs, then your company is within the "small businesses" that might not be able to store everything like I mentioned.
Also, if you aren't deduping Docker images by layers, you're doing it very very wrong.
Where did I say that the company has only 5 people?.. I gave it as a unit of measurement. Typically, hierarchies of s/w companies build on small-ish teams, in my experience, 5 would be the average team size. While there's usually some way to aggregate across teams (in terms of resources), largely, the resources a team needs are independent of other teams.
In other words, if you have a team use X resources, then, roughly, two teams will use 2X resources. But resources per programmer don't make sense in the same way how man-hours don't (because you get fractional people).
And, the truth is that even though large businesses have a potential to save during aggregation, developing this ability is yet another cost that they need to pay all the while the business is running. And the more efficient they want this aggregation to be, the more they need to invest into it. Which means, that large businesses are bound to be more wasteful than small businesses.
So, to give a more concrete example of why there's tendency in resource waste to be of the form:
for N teams, where resource use would be X per team, the cumulative resource use is (N + e)X, where e is small, but is a function of N, perhaps, something like log(N).
Suppose your company's artifacts are PyPI-style Python packages (hopefully, Wheels). For a very small company, you might get away with just having a Git repo with your code, and don't even produce Wheel: just do code checkouts, add current source to PYTHONPATH, and you are good to go.
Suppose your company now grows another team that needs to use the code written by the first team. Well, now they need to agree on their artifacts, the format, the location, so that two teams may independently create and access artifacts. The packages are relatively small and don't require much auditing of external dependencies though, so they simply pay for a small cloud instance to store their packages (eg. use GitHub releases or foundries or w/e it's called).
Suppose your company grows to ten or so teams. Now you start getting conflicting requirements between individual artifacts of different teams, you can no longer use simple foundries provided by services like Github to store your artifacts because you discovered you also need to use patched versions of third-party dependencies, and you have increased audit requirements. You now buy services from something like Artifactory, where you store and mirror a whole lot of Python packages you don't develop.
And, as you grow bigger, you'll realize that solutions like Artifactory also have their limits. At some huge scale, you'll get into business of running your own datacenters, and, perhaps, even your own power plants to power those datacenters. At each level on this growth path, you'll have to adjust your infrastructure to generate less waste compared to what would've happened if you tried to use the methodology you used at the previous level + the infrastructure necessary to bridge between multiple groups, however, each such step will never completely eliminate the price of aggregation, it will simply attempt to make it manageable.
----
In other words: a price of servicing of 1GB of storage for Google is lower than it would've been for an small startup, but they also need a lot more storage to manage per team than a small startup. This is why someone like Google would be in a good position to sell their services of managing storage for small shops, for example.
Now, consider this:
You need to store your Docker images somewhere, right? The registry. This registry needs backups every now and then, otherwise it will... well, one day you'll lose data, perhaps the entire registry full of images.
So, you might think incremental snapshots is the answer, but then there's problem with incremental snapshots: what if one of them is damaged? So, you'd want independent snapshots... And then you want to store them in something that's resilient to failure and can have its parts replaced on demand, so, something like RAID5, so that also adds something like 30% to your storage.
Another thing to consider: what if whatever you want to store isn't composed of independent images that can be reproduced on demand... what if, it's like a Web server with a session? Well, then you might need consistency groups that store your container's state together with the state of the database and whatever else that is using across, potentially, multiple computers... Oy-wey! Now we are looking into buying enterprise-level storage system from something like NetApp... We are probably spending six digit figures only on this single Web server with a database sharded into three servers on yearly basis... And we still don't have geographical redundancy for contingencies like datacenter being hit by a tsunami etc.