At AWS, the hierarchy of service priorities is crystal clear: Security, Durability, and Availability. In that order. Durability, the assurance that data will not be lost, is a cornerstone of trust, only surpassed by security. Availability, while important, can vary. Different customers have different needs. But security and durability? They're about trust. Lose that, and it's game over. In this regard, InfluxDB has unfortunately dropped the ball.
Deprecation of services is a common occurrence at AWS and many other tech companies. But it's never taken lightly. A mandatory step in this process is analyzing usage logs. We need to ensure customers have transitioned to the alternative. If they haven't, we reach out. We understand why. The idea of simply "nuking" customer data without a viable alternative is unthinkable.
The InfluxDB incident brings to light the ongoing debate around soft vs. hard deletion. It's unacceptable for a hard delete to be the first step in any deprecation process. A clear escalation process is necessary: notify the customer, wait for explicit acknowledgement, disable their APIs for a short period, extend this period if necessary, soft delete for a certain period, notify again, and only then consider a hard delete.
The so-called ["scream test"](https://www.v-wiki.net/scream-test-meaning/) is not a viable strategy for a cloud service provider. Proactive communication and customer engagement are key.
This incident is a wake-up call. It underscores the importance of data durability and effective, respectful customer communication in cloud services and platform teams. Communication is more than three cover-your-ass emails; it's caring about your customers.
> Security, Durability, and Availability. In that order.
The ordering of security and durability very much depends on the needs of the customer.
Some data is vastly more valuable to malicious actors than it is to you, e.g. ephemeral private keys. If lost you can simply replace them, but if (unknowingly) stolen it can be disastrous.
Other data is vastly more valuable to your than to malicious actors, e.g. photos of sentimental events.
> At AWS, the hierarchy of service priorities is crystal clear: Security, Durability, and Availability. In that order. Durability, the assurance that data will not be lost, is a cornerstone of trust, only surpassed by security. Availability, while important, can vary. Different customers have different needs. But security and durability? They're about trust. Lose that, and it's game over. In this regard, InfluxDB has unfortunately dropped the ball.
Interestingly, this is also how I'd allocate tasks to new admins. Like, sure, I'd rather have my load balancers running, but they are stateless and redeploy in a minute. The amount of damage you can do there in less critical environments is entirely acceptable for teaching experiences. Databases or filestores though? Oh boy. I'd rather have someone shadow for a bit first because those are annoying to fix and will always cause unrecoverable loss, even with everything we do against it. Hourly incremental backups still lose up to 59 minutes of data if things go wrong.
> The InfluxDB incident brings to light the ongoing debate around soft vs. hard deletion. It's unacceptable for a hard delete to be the first step in any deprecation process. A clear escalation process is necessary: notify the customer, wait for explicit acknowledgement, disable their APIs for a short period, extend this period if necessary, soft delete for a certain period, notify again, and only then consider a hard delete.
Agreed. At work, I'm pushing that we have two processes: First, we need a process of deprecating a service and migrating customers to better services. This happens entirely at a product management and development level. Here you need to consider the value provided for the customer, how to provide it differently - better - and how to decide to fire customers if necessary. And afterwards, you need a good controlled process to migrate customers to the new services, ideally supported by customer support or consultants. No one likes change, so at least make their change an improvement and not entirely annoying.
And then, if a system or an environment is not needed anymore, leadership can trigger a second process to actually remove the service. I'm however maintaining that this is a second process which is entirely operational between support, operations and account management. It's their job to validate the system is load-free (I like the electricians term here), or that we're willing to accept dropping that load. And even then, if we just see a bunch of health checks on the systems by customers, you always do a scream test at that point and shut it down for a week, or cut DNS or such. And only then you drop it.
It's very, very careful, I'm aware. But it's happened 3-4 times already that a large customer suddenly was like "Oh no we forgot thingy X and now things are on fire and peeps internally are sharpening knifes for the meeting, do anything!" And you'd be surprised how much goodwill and trust you can get as a vendor by being able to bring back that thing in a few minutes. Even if you have to burn it then to turn up the heat to get them off of that service, since it'll be around forever otherwise.
While HTTP3 does provide some improvements, it's clear that further optimizations are needed, particularly for users in Europe and Asia.
One potential solution that hasn't been discussed much here is leveraging AWS managed services like AWS CloudFront or AWS Global Accelerator. It's worth noting that Dropbox's website already uses AWS CloudFront, so they are already leveraging AWS in some capacity. Based on my cost calculations, using AWS CloudFront would cost around $40k a month, while AWS Global Accelerator would be around $22k a month.
As of August 22, 2022, AWS CloudFront supports terminating HTTP/3 in a Point of Presence (POP), which could potentially help with the latency issues Dropbox is facing. AWS Global Accelerator, on the other hand, is designed to improve the performance of applications by terminating UDP/TCP as close to users as possible then routing user traffic through the AWS global network infrastructure. This could help reduce latency by ensuring that user traffic is routed through the most optimal path, even if the user is located far from a Dropbox data center.
It's hard to estimate what the potential latency reduction of using e.g. AWS Global Accelerator is, especially at higher percentiles. However, using https://speedtest.globalaccelerator.aws/, and assuming symmetry, my connections to Asia are 35-40% lower latency.
Of course, there are trade-offs to consider when using managed services like AWS CloudFront and AWS Global Accelerator. While they can provide significant performance improvements, they also come with additional costs and potential vendor lock-in. However, given the scale of Dropbox's operations and the importance of providing a fast, reliable search experience for their users, it may be worth exploring these options further.
---
Cost estimates
Assumptions:
1. Dropbox's peak traffic is 1,500 queries per second (QPS).
2. Average data transfer per query is 100 KB.
3. 50% of the traffic comes from North America, 25% from Europe, and 15% from Asia (remaining 10% from other regions, for pricing purposes put it into Asia's calculations).
---
1) AWS CloudFront cost estimation:
Data transfer:
- North America: 1,500 QPS * 0.5 * 100 KB * 60 seconds * 60 minutes * 24 hours * 30 days = 194.4 TB
- Europe: 1,500 QPS * 0.25 * 100 KB * 60 seconds * 60 minutes * 24 hours * 30 days = 97.2 TB
- Asia: 1,500 QPS * 0.25 * 100 KB * 60 seconds * 60 minutes * 24 hours * 30 days = 97.2 TB
Data transfer cost:
- North America: 194.4 TB * $0.085/GB = $16,524
- Europe: 97.2 TB * $0.085/GB = $8,262
- Asia: 97.2 TB * $0.120/GB = $11,664
Total data transfer cost: $16,524 + $8,262 + $7,006 = $36,200
HTTP requests:
Total requests: 1,500 QPS * 60 seconds * 60 minutes * 24 hours * 30 days = 3,888,000,000
Using the updated AWS CloudFront pricing (as of May 22, 2023):
- HTTP requests cost: 3,888,000,000 * $0.0075/10,000 = $2,916
Total estimated monthly cost for AWS CloudFront: $36,200 (data transfer) + $2,916 (HTTP requests) = $40k
---
2) AWS Global Accelerator cost estimation:
Data transfer:
- Total data transfer: 194.4 TB (NA) + 97.2 TB (EU) + 97.2 TB (Asia) = 388.8 TB
Using the updated AWS Global Accelerator pricing (as of May 22, 2023):
- Data transfer cost (averaged across regions): 388.8 TB * $0.035/GB = $13,608
- (Also need to add EC2 egress cost, 388.8 TB * $0.02/GB = $7,776
- Total data transfer cost = $21,384
Accelerator:
- Assuming 1 accelerator with 2 endpoints (1 for HTTP/2 and 1 for HTTP/3)
- Accelerator cost: 1 * $18/accelerator/day * 30 days = $540
Total estimated monthly cost for AWS Global Accelerator: $21,384 (data transfer) + $540 (accelerator) = $22k
Former Dropbox employee, just correcting one assumption:
> Dropbox's peak traffic is 1,500 queries per second (QPS).
I can't speak to search QPS directly, but most individual serving hosts for file sync/retrieval were receiving tens of thousands of QPS. The overall edge QPS peaked at several hundreds of thousands QPS every day across all the hosts. So I'd guess that even just search is an order of magnitude higher than 1,500 :)
When I worked at AWS there was a similar scenario in eu-west-2. There was a fire in one of the availability zones (AZs). The fire suppression system kicked in and flooded the data center up to ankle or knee height. All the racks were powered off and the building was evacuated for hours (I don't remember the duration of the evacuation) until the water was pumped out.
But for the service team I worked for, our AZ-evacuation story wasn't great at the time and it took us tens of minutes to manually move out of the AZ, but at least there wasn't a customer-visible availability impact. Once we did it was just monitoring and baby-sitting until we got the word to move back in, I think it was 1-2 days later.
If you operate on AWS you work with the assumption that an AZ is a failure domain, and can die at any time. Surprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time). But if you operate services in the cloud you have to know what the failure domain is.
> urprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time)
Ouch, hopefully none of the major services? I recently had to look into this for work (for disaster recovery preparation) and it seemed like ECS, Lambda, S3, DynamoDB and Aurora Serverless (and probably CloudWatch and IAM) all said they handled availability zone failures transparently enough.
I’m familiar with Lambda and DynamoDB. When I left in 2022 they both had strong automated or semi-automated AZ evacuation stories.
I’m not that familiar with S3, but I never noticed any concerns with S3 during an AZ outage. I’m not at all familiar with Aurora Serverless or ECS.
For all AWS services you can always ask AWS Support pointed, specific questions about availability. They usually defer to the service team and they’ll give you their perspective.
Also keep in mind that AWS teams differentiate between the availability of their control and data planes. During an AZ outage you may struggle to create/delete resources until an AZ evacuation is completed internally, but already created resources should always meet the public SLA. That’s why especially for DR I recommend active-active or active-“pilot light”, have everything created in all AZs/regions and don’t need to create resources in your DR plan.
Okay good to know - Lambda seemed to suggest it could handle an availability zone going down without any trouble.
ECS Fargate's default is to distribute task instances across availability zones too but I'm assume if you use EC2 it might not be as straight-forward.
And that makes sense - I remember during the last outage that affected me it was a compute rather than data failure and the running stuff continued fine, just nothing new was getting created.
1) avg mean (for throughput estimates),
2) p50 (typical customer / typical load),
3) p90 / p99 / p99.99 (whatever you think your tail is)
4) p100 (max, always useful to see and know, maybe p0).
Or throw in a histogram or kernel density estimate, sometimes there are really interesting patterns.
I’ve never seen a technical blog give such traffic or latency details though.
Edit: reading other comments, please do not read this as diminishing this blog post or Discord, great clear writing and impressive solution to an interesting problem.
Unless you're happy in a slightly different way than expected. Oh and we aren't going to bother documenting what we mean by happy, and it may be different for different functions, and we may change it at any time as a side effect of normal code maintenance.
Congratulations! I did the exact same thing; made an app to help name my first child.
I've slowly developed a method which incorporates 1) culture, 2) popularity in different countries, and 3) pronunciation of names and attempts to recommend you names based on names you like. It kind of works, it's taken a lot of tuning to make it output something sensible. It's specifically designed to attempt to combine cultures together, which is a top request from customers I identified.
I've been working on releasing the app for a while. If you're interested in helping me test it before its release this month please feel free to sign up here:
Amazon may use keys from a variety of sources, but the ones I’ve seen were packaged differently from anything I’ve seen from an actual Yubikey.
If only people were allowed to bring their own Yubikey for U2F and OTP, then they wouldn’t have to wait on whatever official procurement processes are in place from their approved suppliers.
I've got a couple branded ones from amazon IT -- a big one and a tiny one. Currently they're giving out some other one, but I think you can bring your own too
“To identify the location of your resources relative to your accounts, you must use the AZ ID, which is a unique and consistent identifier for an Availability Zone. For example, use1-az1 is an AZ ID for the us-east-1 Region and it is the same location in every AWS account”
So each separate AWS account will have a different AZ name that maps to use1-az1, but use1-az1 is a region-wide constant.
I haven’t read the paper. But if the researchers were bringing in children regularly for testing involving eye tracking equipment, why not also test their ability to manipulate physical objects, recognize faces, and track faces? That all seems like low hanging fruit.
Deprecation of services is a common occurrence at AWS and many other tech companies. But it's never taken lightly. A mandatory step in this process is analyzing usage logs. We need to ensure customers have transitioned to the alternative. If they haven't, we reach out. We understand why. The idea of simply "nuking" customer data without a viable alternative is unthinkable.
The InfluxDB incident brings to light the ongoing debate around soft vs. hard deletion. It's unacceptable for a hard delete to be the first step in any deprecation process. A clear escalation process is necessary: notify the customer, wait for explicit acknowledgement, disable their APIs for a short period, extend this period if necessary, soft delete for a certain period, notify again, and only then consider a hard delete.
The so-called ["scream test"](https://www.v-wiki.net/scream-test-meaning/) is not a viable strategy for a cloud service provider. Proactive communication and customer engagement are key.
This incident is a wake-up call. It underscores the importance of data durability and effective, respectful customer communication in cloud services and platform teams. Communication is more than three cover-your-ass emails; it's caring about your customers.