There was a time when AWS was truly innovative, but it’s long since transformed into Amazon’s cash cow and is behaving like such.
Innovation has ground to a halt of mostly just meh “hey us too” launches. Pricing and design patterns feel increasingly focused on locking you in. AWS folks tell me internally they talk a lot about making sure things are “sticky” with customers. The best engineering talent no longer wants to work there and it shows, especially in places like AI where AWS has just released wave after wave of discombobulated nonsense.
As a core “rent-a-server” concept with a few add on services there’s still a lot of utility, but AWS is gradually becoming a boring baseline utility with a ton of distracting half baked stuff jammed on top. Most companies I talk to are no longer focused on single cloud and increasingly are bringing a lot of workloads back on prem or in colos. Not everything, but for a lot of stuff that just makes more sense and is a heck of a lot cheaper.
The chips business in Annapurna is probably the most interesting thing and that plays to its strength of the boring low level infrastructure stuff. Nearly everything AWS tries to do beyond chips and rent-a-server plays is a hot mess.
AWS isn’t going away, but its future looks a lot less exciting and inspiring than the story that got us to this point.
AWS’s US-East 1 continues to be the Achilles heel of the Internet.
And while yes building across multiple regions and AZs is a thing, AWS has had a string of issues where US-East 1 has broader impacts, which makes things far less redundant and resilient than AWS implies.
The idea that AWS's services are fully regionalized or isolated has always been a myth.
All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.
And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.
During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.
Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.
But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.
What people usually think is “resilience up to a reasonable level of risk and cost”.
Multi-cloud is simply isn’t cost beneficial for 99.9% of problems.
And for a lot of businesses who talk about risk, saying “we followed AWS best practices but AWS went down” is an acceptable answer to the question of liability.
If you are in a position where AWS going down is a reasonable risk, then you’re already in a specialised enough domain to have engineers who understand how to deliver HA across different vendors.
Yes, I'm certainly aware of the other partitions. That's why I said all the public cloud regions outside China.
Yeah, "govcloud" is technically available to the public, although there are other partitions reserved for government use that are not, and the naming is a big hairy mess. Many service teams don't have any US-citizens-in-the-USA working for them, and they cannot in any way adequately support these regions.
My on-call experience improved significantly when I moved from the US to Canada, and I got taken off the (extremely thin!) list of engineers eligible to ssh into RDS instances in Govcloud. There were so few USA-citizen-in-USA engineers that I had been getting tickets for services and instances in Govcloud about which I had only the very thinnest knowledge… and then I was limited in my ability to consult with others who were actually experts. The customers in Govcloud paid a premium to be there, I got paged for a bunch of tickets which I was ill-prepared to handle, and it was generally a bad experience for everyone.
Working with the airgapped secret/top-secret partitions was even worse. You would get paged incessantly and then someone who was cleared for access but knew almost nothing about the service in question would have to go to a SCIF in the DC area, and you would exchange screenshots and text instructions with a turnaround time of hours or days.
They’re talking about the backbone and what goes on behind the scenes. There have been issues with services in other regions when us-east-1 has issues.
Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.
It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.
Sometimes the circular dependencies get almost cartoonishly silly.
Like, "One of the two guys who has the physical keys to the server cage in us-east-1 is on vacation. The other one can't get into his apartment because his smart lock runs into the AWS cloud. So he hires a locksmith, but the locksmith takes an extra two hours to do the job because his reference documents for this model of lock live on an S3 bucket."
We had a pair of machines. And some bright spark set them up to mount each others NFS shares. after a power outage "Holy mother of chicken and egg NFS hangs batman"
That was a weird job, fun, it was a local machine room for a warehouse that originally held the IBM mainframe, it still held it's successor "the multiprise 3000" which has the claim to fame as being the smallest mainframe IBM ever sold. But now the room was also full of decades of artisanal crafted unix servers with pick databases. the pick dev team had done most the system architecture. The best way to understand it is that for them pick is the operating system, unix is a necessary annoyance they have to put up with only because nobody has made pick hardware for 20 years. and it was NFS mounts everywhere, somebody had figured out a trick where they could NFS mount a remote machine and have the local pick system reach in and scrounge through the remote systems data. But strictly read-only. pick got grumpy when writing to NFS not to say anything about how the other database would feel about having it's data being messed with. Thus the circular mount.
Still was not the worst thing I saw. I liked the one system with a SMB mount. "Why is this one SMB?" "Well pick complains when you try to write to a NFS mount, but it's NFS detection code does not trip on SMB mounts." ... Sighs "Um... I am no pick expert but you know why it does not like remote mounts right. SMB does not change that, Do you happen to get a lot of corrupt indexes on this machine?"
"yes, how did you know"
Oh, yeah, re-exporting NFS mounts via SMB was very much a thing in the early 2000s - something to do with their different approaches to flock() vs fcntl() handling. If you ran into locking issues with nfs, then re-exporting via SMB was standard advice.
At some point, the behaviour changed and locks starting conflicting. IIRC, we hit it when upgrading to Debian Etch and took the time to unwind the system and make pure NFS work properly for us. Plenty of people took the opposite approach, and fiddled with the config to make locking a noop on SMB. I know of at least one web hosting company who ended up having to restore a year's worth of customer uploads from backups as a result...
> Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.
There was one (later denied) report that a 'guy with an angle grinder' was involved in gaining access to the server cage.
Why would such a critical server even be accessible with only one set of keys?
I’ve always thought mission critical stuff needs two independent key holders, with key holes placed far apart enough to make it impossible for 1 person to reach both.
I dont know how it is in the datacentre industry, but certainly in other industries that is how its done for anything truly mission critical and also easily tampered with.
I guess it shows very few care enough to pay enough to make that a reasonable upgrade.
They're not actually accessible with 'only one set of keys' in my experience.
You actually have to present your photo ID at the site entry gatehouse, then again to the building entry guard (who will also check you have a work permit and a site-specific safety induction) then you swipe a badge at a turnstile to get from reception into the stairwell, then swipe your badge at a door to get into the relevant floor, then swipe your badge and key in a code to enter the room with the cages then you use the key.
a circular dependency and a single point of failure are not the same thing. If I have a single point of failure and it is down, I fix that and things work again. If I have circular dependency, there is no obvious way to fix anything that is broken any longer.
Amazon's equivalent of this (sort of) was the "andon cord."
Not only was the physical metaphor that led to this name never properly explained (it's basically an "emergency stop" string in a Toyota factory), but the actual use of this mechanism was so heavily discouraged that I never saw it used in 4+ years at Amazon. except once very performatively by a VP who had already been paged awake at 2am or something like that.
In my experience, a lot of the AWS engineers live in continuous fear of screwing up by using the huge array of extremely powerful, dangerous, poorly-explained, and ever-changing tools that they have access to.
> The idea that AWS's services are fully regionalized or isolated has always been a myth.
This is highly misleading. It's true that there's a handful of global AWS services - but only their control planes operate from a single region (e.g. us-east-1). Their data planes are regionally isolated or globally distributed.[1]
The only time you'd normally use a service control plane is to deploy changes, e.g. when you create, read, update or delete service resources or update configuration during a change window.
Workloads should be designed for "static stability", as recommended by AWS.[2] A statically stable workload only depends upon the data planes of the services it uses at runtime. Statically stable workloads are designed to continue operating as normal even if there's a service event impairing one or more control planes (including for global services).
> During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.
This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region [3]. The IAM data plane, which enforces access control, is also regional.
If the IAM control plane is impaired, you might not be able to create new IAM roles (a control plane operation) - but you can continue generating and using temporary credentials for existing IAM roles (data plane operations) within the region your workload is running in. This allows statically stable workloads to continue using IAM without interruption.
"Global AWS services still follow the conventional AWS design pattern of separating the control plane and data plane in order to achieve static stability. The significant difference for most global services is that their control plane is hosted in a single AWS Region, while their data plane is globally distributed."
"...eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads."
You're right of course to distinguish the control plane and data plane, and it sounds like you know more about this than I do for IAM.
I disagree, though, that my post was "highly misleading" despite this omission.
As a practical matter, some services fail to achieve the "static stability" you describe, in terms of not depending on other services’ control planes.
And also, many on-calls ops and firefighting tasks (to say nothing of canaries and other automated tests) depend on other services’ control planes.
And above all, many AWS engineers (myself very much included even after years there) don't have a clear understanding of the boundaries of other services’ control planes. https://news.ycombinator.com/item?id=48078254
> > During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.
> This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region.
I didn't mention STS in the service to which you're responding. The service that I worked on the most, RDS, required ssh'ing into live instances to solve basically all non-trivial problems (I'd guess 80% of the tickets that I saw actually resolved required it). And I have no idea if it how STS was involved in generating the ephemeral Midway-signed ssh keys required for it… but whenever there were us-east-1 IAM outages we'd have big problems opening new sessions, while less-capable web-console-based ops tools with long-lived credentials would keep working.
People say this, but this this was just a single AZ, and in the last 3 years of running my startup mostly out of use-1, and we've only had one regional outage, and even that was partial, with most instances uneffected.
And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.
Not OP, but I do single-region us-east-1 for a few reasons:
1. The severity and frequency of us-east-1 outages are vastly overstated. It's fine. These us-east-1 outages almost never affect us. This one didn't; not even our instances in the affected AZ. Only that recent IAM outage affected us a little bit, and it affected every other region, too, since IAM's control plane is centrally hosted in us-east-1. Everybody's uptime depends on us-east-1.
2. We're physically close to us-east-1 and have Direct Connect. We're 1 millisecond away from us-east-1. It would be silly to connect to us-east-1 and then take a latency hit and pay cross-region data transfer cost on all traffic to hop over to another region. That would only make sense if we were in both regions, and that is not worth the cost given #1. If we only have a single region, it has to be us-east-1.
3. us-east-1 gets new features first. New AWS features are relevant to us with shocking regularity, and we get it as soon as it's announced.
4. OP is right about the safety in numbers. Our service isn't life-or-death; nobody will die if we're down, so it's just a matter of whether they're upset. When there is a us-east-1 outage, it's headline news and I can link the news report to anyone who asks. That genuinely absolves us every time. When we're down, everybody else is down, too.
Sometimes you need capacity and you have to choose where the capacity is not where you would like it to be. Unfortunately, the days of cloud bursting, and thinking of the cloud as an unlimited resource where you can spin up and spin down machines at will is vanishing. Power availability and supply chain lead times combined with unprecedented demand are the reason for this. That's why you see all the hyperscalers recently reporting on their "backlog" in their earnings reports.
It wasn’t even all of a single AZ. None of my resources in use1-az4 had any issues. The most annoying thing was the 20 notifications we got saying “it’s not all fixed yet” every hour.
In fantasy magic dream land loads are distributed evenly across different cloud providers.
A single point of failure doesn't exist.
It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.
Healthcare in the US is affordable.
All types of magical stuff exist here.
But no. It's another day. AWS US-East 1 can take town most of the internet.
yes, they have. It just costs a shit ton of money and is extremely difficult to get the suits to sign off on TWO full 'cloud services' bills. It generally doubles your cost and workload and increases your uptime by a couple hours/year, assuming you don't have bugs that affect one or the other cloud in your deployment stack.
It's basically a wash for almost all organizations for twice the cost and effort.
also these things don't go down THAT often... well aws, not some others. More uptime that you probably had before. even the stock market takes a few days off every decade. Just ask W.
Not really. Your clients can random robin to connection points across providers and move write heads upon connection. If you worry about hard coding you can reduce the surface to a per-context first minimum contact point.
Yes, you're right, but in my experience the boundary between the data plane and the control plane is not always clear, and especially unclear on these foundational and basic services.
There were enough "surprisingly control-plane" IAM operations in the AWS services that I dealt with, so we had to exercise extreme caution during outages.
Even if I were the stupidest and least curious engineer around (and I was far from it), that's basically irrelevant to what you're scolding me for here…
As part of a team with both software development and operational responsibilities, like most teams at AWS, I had to deal not only with the consequences of my own imperfect knowledge, but also with the imperfect knowledge of my coworkers past and present.
anecdotally (well, more "second-hand-ly i heard that..." it sounds like there were some carry-on effects on us-east-2 as a result of people migrating over from us-east-1, so, yeah... kinda hilarious how the multiple region / AZ thing is just so plainly a façade, but yet we all seem to just collectively believe in it as an article of faith in the Cloud Religion... or whatever...
One of the SRE tricks is to reserve your capacity so when the cloud runs out of capacity you're still covered. It's expensive, but you don't want to get stuck without a server when the on-demand dries up.
It really is failing more, and it’s well known amongst industry experts. It’s the oldest, largest, and most utilized region of AWS.
I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.
It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.
Most of the other regions are fairly stable. Ohio (us-east-2) is a great choice if you're just starting out. Not sure about ca-central-1, but I've never heard anything bad about it.
It wasn't heavily utilized when I worked at AWS, until 2024.
If your customers are clusterrd in Toronto and Montreal, it probably makes a lot of sense to use ca-central-1. If you've got a lot of customers in Western Canada, us-west-2 is gonna have better network latency.
Other than a couple regions that had problems with their local network infrastructure (sa-east-1 was like that), there's little or nothing to differentiate the regions in terms of physical infrastructure and architecture.
> building across multiple regions and AZs is a thing
If you do this for resiliency, be prepared to pay the capacity tax (2 regions means 2x capacity, 3 regions means 1.5x), have the machines already running in a multi-region setup (don't expect to be able to spin up instances or even get capacity during an outage), and ready to deal with the added complexity of multi-region hosting.
There’s all kinds of fun pitfalls with multi-AZ. Like you can create RDS subnets across multiple AZs but then you can’t remove an AZ. Which really sucks when your core database covers all 5 us-east-1 AZs and randomly can’t failover because you picked an instance type that use1-az4 can’t host.
I've always been impressed by Amazon's ability to present the shittiest experience possible and imply the blame is with things like isolation that they don't really provide.
I know 4 people who’ve worked for them, two on the same team, one who’s moved around various offices over a long period of time, and one who used to work for me but left to go there (and with the offer they were made I absolutely don’t blame them - we’d been hit hard by COVID and were in the midst of a salary freeze).
The first three of those people seem to have got on well there and mostly enjoyed it. The fourth had a miserable 8 months. Their manager was based in a different office with a 5+ hour time difference and was a complete nightmare, and they left of their own accord to another job on about the same money but without all the extra hours, stress, and terrible management.
I’m guessing Amazon is like other big companies: the quality of your experience will depend to a very large extent on your manager.
I've worked with some Amazon fanboys who'd rave about being "bar raisers" and other assorted nonsense, trying to impress Amazon-derived "leadership principles" upon much smaller organizations. It left a very bad taste in my mouth.
Well FAANG is now MANGO and yes Amazon dropped out of the top tier of tech companies on the market per these acronyms. Theres a few others other there gaining popularity which also now exclude Amazon as a top tier tech company.
Amazon is successful on the boring utility stuff (logistics, building data centers) but is broadly seen as unable to execute on higher value add things, which keeps it out of the top tier. The AI misses really highlighted that.
Isn’t AWS a massive value add operation on top of what is otherwise just rental servers?
It isn’t sexy, but they are selling their proprietary technology to just about everyone. That and their market cap puts them in league with the rest of the big boys.
Seems no CEO simply wants to say the company is under performing, we hired too many people, and now we’re resetting. But it’s clear that’s what happened on nearly all these layoffs.
None of these announcements provide any convincing evidence that AI is anything more than a convenient distraction from the real reasons for the layoffs.
Vague overtones on AI savings without any hard evidence that’s happening, while ignoring obvious evidence that the company over-hired and is now underperforming relative to what would be needed to justify all that headcount.
Nobody believes these narratives at this point and CEOs would garner a lot more respect if they just simply said
“I screwed up, we hired too many people and fell short of performance targets. I own that. This resets us back on track.”
I’d have a lot more respect and the whole thing would be a hell of a lot easier if each person just picked up after themselves and there was no herculean cleanup required
There are some improvements on coding and speed of developers, but more broadly in the enterprise AI is just producing a lot of slop that folks are getting fed up with.
AI content has a look and feel people sense immediately.
It’s amazing to see how quickly things shifted from “wow this is so cool, AI is going to change everything” to folks calling out “you lazy bum, this just looks like some slop you threw together with AI… let’s get some real thinking please.”
We are firmly heading into “trough of disillusionment” territory on the hype cycle.
Not really… the rules were heavily influenced by big tech in a way that basically exempts many devices. For example iPhone 15 onwards already meets the defined standard and thus doesn’t need a user replaceable battery.
So the headline is misleading. Removable batteries aren’t mandatory. They’re only mandatory if the battery fails to meet certain performance standards.
Based on other comments, there are apparently two separate sets of rules. One of them has exceptions and supplies specifically to phones and tablets, and one of them doesn't have any exceptions and applies more generally.
This whole prediction market space seems like a 2026 version of ball and cup game betting. Most of the people participating fail to understand that they (and their pointless bets up for harvesting) are the product here.
My intuition is that it is a lot like sports betting: many laypeople bet for what they hope will happen, rather than trying to beat the market earnestly.
The winners, as you point out, are the house and those with insider knowledge.
Most airports have some form of this, it’s just not public or promoted and is sort of an if you know you know thing for special cases.
For example the airline can give you a “gate only” pass. Essentially you need somebody to sponsor you to be on the other side of the security gates. In this case the airport itself is openly offering to sponsor folks.
Innovation has ground to a halt of mostly just meh “hey us too” launches. Pricing and design patterns feel increasingly focused on locking you in. AWS folks tell me internally they talk a lot about making sure things are “sticky” with customers. The best engineering talent no longer wants to work there and it shows, especially in places like AI where AWS has just released wave after wave of discombobulated nonsense.
As a core “rent-a-server” concept with a few add on services there’s still a lot of utility, but AWS is gradually becoming a boring baseline utility with a ton of distracting half baked stuff jammed on top. Most companies I talk to are no longer focused on single cloud and increasingly are bringing a lot of workloads back on prem or in colos. Not everything, but for a lot of stuff that just makes more sense and is a heck of a lot cheaper.
The chips business in Annapurna is probably the most interesting thing and that plays to its strength of the boring low level infrastructure stuff. Nearly everything AWS tries to do beyond chips and rent-a-server plays is a hot mess.
AWS isn’t going away, but its future looks a lot less exciting and inspiring than the story that got us to this point.
reply