Basecamp Outage Post-mortem

BrentOzar · on Nov 12, 2018

Short story: storing identities in an integer field instead of a bigint.

Now would be a good time to check your own IDs and get the max value of each. Sure is easier to fix this ahead of time than when your app is down.

nathanaldensr · on Nov 12, 2018

Consider using GUIDs/UUIDs instead of integers. GUIDs are friendlier for replication, fake/mock data, and other reasons. The only situations where I'd recommend against using GUIDs is if total row size is a concern (GUIDs are usually larger on disk than integers) or if there is some legacy requirement to use them (e.g., an integration with a third-party).

Another thing I've done is use GUIDs for the primary key, then have an identity/sequence column called something like Sequence, which retains the sequencing capabilities of a regular integer identity column.

otterley · on Nov 12, 2018

The flip side is that GUIDs take up a ton of space and are full of entropy, which makes them very inefficient to work with, are not cache-friendly, and not sortable. Certain database engines (e.g., InnoDB on MySQL) also limit the length of primary keys for performance reasons, which disqualifies GUIDs from being used. With clustering, short keys are better as they reduce IOPS required to fetch data.

See https://www.percona.com/blog/2006/10/03/long-primary-key-for... for an interesting discussion of the topic.

In general, it's unwise to make recommendations about what data types to use, especially for primary keys, without thorough research on their implications.

snowwolf · on Nov 12, 2018

I found https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-ca... has a good rundown of the pros and cons.

spullara · on Nov 12, 2018

Use twitter snowflake ids — they pulled their implementation as it was out of date but Sony has one:

https://github.com/sony/sonyflake

overcast · on Nov 12, 2018

Vast majority of projects will not reach two billion row tables(ID>0), but it should be something considered when planning, and regularly revisited in the future. BIGINT from the start uses twice as much space, if it's not needed, it's just a waste. You have to think about memory, indexing, scans in the future. Hundreds of millions of rows x 4 more bytes, is A LOT.

preek · on Nov 12, 2018

Great outage ticker by DHH on the longest Basecamp “outage” in 10 years! It was only 5h. I know that feels long for users, and it is. However, it wasnt even a complete outage, but instead brought the app to a read only mode. And 10 years is a really long time to go with a 99.998% uptime.

Everything about this is amazing to me. Rails and good small dev teams still rock, probably more so now than ever!

xiphias2 · on Nov 12, 2018

99.998% sounds great, but 99% here not so much: https://basecamp.com/3/uptime It would be better if it would show a more precise number.

preek · on Nov 12, 2018

What do you mean? Your link is showing "Basecamp 3 has been up 99.978% of the time since launch." as the main header. That's pretty precise(;

sudhirj · on Nov 14, 2018

The text starts with 99%, and the .987 is being added with javascript after page load. If you turn off JS you’ll only see 99.

JshWright · on Nov 12, 2018

We also ran into this issue a few months ago. Our problem is our data model for the table in question is a mess, with literally hundreds of foreign keys pointing at that ID column (so all of those tables would need to be altered as well). Migrating all of those would have resulted in an unacceptably long downtime (i.e. multiple days). We started down that path, but quickly realized it wasn't a viable option.

So, plan B... It's a signed integer... The auto increment started at 1 and counted up from there... We literally had 50% of the ID space still available to us. So now we've started at -1 and are counting our way down from there. Obviously we're adding rows a lot faster now than we were at the beginning of time, but it has bought us more than enough time to fix the issue properly.

It took a bit of cleverness to get those negative IDs (MySQL won't let you set the auto increment start point to a value less than the current max ID), but the end result is that we got back up and running in a couple hours, as opposed to the multiple days we initially thought we were looking at.

davedx · on Nov 12, 2018

Stop the world db migrations like this suck. I read the other day postgres can create indexes concurrently, can any db engine do bigger changes like in TFA concurrently? That would be cool.

faitswulff · on Nov 12, 2018

What does data verification in a large rails project like that entail?

lykr0n · on Nov 12, 2018

> We should have known better. We should have done our due diligence when this improvement was made to the framework two years ago. I accept full responsibility for failing to heed that warning, and by extension for causing the multi-hour outage today. I’m really, really sorry

I don't know why, but seeing "I’m really, really sorry" in a Post-mortem doesn't feel right.

vikingcaffiene · on Nov 12, 2018

> I don't know why, but seeing "I’m really, really sorry" in a Post-mortem doesn't feel right.

Maybe your comment just reads wrong (context is tough in text format) but I wish you would reconsider this idea. Here is a business owner accepting full responsibility and doing his best to provide some insight into what went wrong and why it took so long to get back online. It reads to me like someone who cares deeply about his customers and wants to project the sense that they are his number one concern. Its honestly refreshing and I wish it was a more common thing. I'm not a BC customer but I'll be giving them a try next time I am in the market. They clearly care.

Vanderson · on Nov 12, 2018

>I don't know why, but seeing "I’m really, really sorry" in a Post-mortem doesn't feel right.

Is there anything he could have said that would appease everyone? I think he is really stuck, and at least he took responsibility for what happened.

That seems rare these days, and even if his wording doesn't feel right, his words and the detailed explanation of their actions deserve a lot of respect.

Diederich · on Nov 12, 2018

To me, it projects a deep sense of ownership and responsibility, which is right on.

ifightcrime · on Nov 12, 2018

It’s not a postmortem. Second paragraph mentions one will be following soon.