OR there wasn't backpressure on a cascading failover so as services failed they increasingly failed to more and more overloaded systems
OR there WAS backpressure and it was the luck of the draw whether you were queued into an error page or got good data
OR the autoscaling couldn't keep up with the onsale window. This used to happen in ticketing a lot. Ticketmaster has a talk somewhere where they talk about warming the scaling load and server cache in anticipation of big ticketing onsales. The time it took to autoscale was just too long.
The systems at Amazon are too large to leverage autoscaling. When I was there (in Marketplace) it was generally estimated that Amazon used about 80X the capacity of their entire AWS public cloud.
I find that implausible. I know they run a huge infrastructure but why would they be using 80X of their AWS public cloud? Thousands of companies if not millions at this stage use AWS and some of them are not insignificant in size.
> Amazon used about 80X the capacity of their entire AWS public cloud
which is probably closer to "the capacity that is available via AWS is a tiny, tiny fraction of their overall computing power.. therefore adding it back in when things are falling over doesn't actually solve any problems."
I think they are using the term 'capacity' to mean 'spare capacity'. I.e. that Amazon's entire compute usage is 80x the spare capacity, so scaling even a small amount would consume any spare capacity in AWS. Still, it seems hard to believe.
1. he misspoke and meant "80%" of the AWS capacity, which I agree seems implausible.
2. Amazon does not run on AWS because Amazon is 80x more than all of AWS infrastructure. This also seems implausible because of Netflix. In fact, there's an article out there that said AWS exceeded Amazon's capacity within 1 quarter!
I still don't understand what that has to do with autoscaling exactly
I did’t understand either. I suspect it’s the semantic difference between ‘public cloud’ and ‘infrastructure’. I don’t know what that difference is really.
OR there wasn't backpressure on a cascading failover so as services failed they increasingly failed to more and more overloaded systems
OR there WAS backpressure and it was the luck of the draw whether you were queued into an error page or got good data
OR the autoscaling couldn't keep up with the onsale window. This used to happen in ticketing a lot. Ticketmaster has a talk somewhere where they talk about warming the scaling load and server cache in anticipation of big ticketing onsales. The time it took to autoscale was just too long.