Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Basecamp Outage Post-mortem (signalvnoise.com)
42 points by sudhirj on Nov 12, 2018 | hide | past | favorite | 18 comments


Short story: storing identities in an integer field instead of a bigint.

Now would be a good time to check your own IDs and get the max value of each. Sure is easier to fix this ahead of time than when your app is down.


Consider using GUIDs/UUIDs instead of integers. GUIDs are friendlier for replication, fake/mock data, and other reasons. The only situations where I'd recommend against using GUIDs is if total row size is a concern (GUIDs are usually larger on disk than integers) or if there is some legacy requirement to use them (e.g., an integration with a third-party).

Another thing I've done is use GUIDs for the primary key, then have an identity/sequence column called something like Sequence, which retains the sequencing capabilities of a regular integer identity column.


The flip side is that GUIDs take up a ton of space and are full of entropy, which makes them very inefficient to work with, are not cache-friendly, and not sortable. Certain database engines (e.g., InnoDB on MySQL) also limit the length of primary keys for performance reasons, which disqualifies GUIDs from being used. With clustering, short keys are better as they reduce IOPS required to fetch data.

See https://www.percona.com/blog/2006/10/03/long-primary-key-for... for an interesting discussion of the topic.

In general, it's unwise to make recommendations about what data types to use, especially for primary keys, without thorough research on their implications.


I found https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-ca... has a good rundown of the pros and cons.


Use twitter snowflake ids — they pulled their implementation as it was out of date but Sony has one:

https://github.com/sony/sonyflake


Vast majority of projects will not reach two billion row tables(ID>0), but it should be something considered when planning, and regularly revisited in the future. BIGINT from the start uses twice as much space, if it's not needed, it's just a waste. You have to think about memory, indexing, scans in the future. Hundreds of millions of rows x 4 more bytes, is A LOT.


Great outage ticker by DHH on the longest Basecamp “outage” in 10 years! It was only 5h. I know that feels long for users, and it is. However, it wasnt even a complete outage, but instead brought the app to a read only mode. And 10 years is a really long time to go with a 99.998% uptime.

Everything about this is amazing to me. Rails and good small dev teams still rock, probably more so now than ever!


99.998% sounds great, but 99% here not so much: https://basecamp.com/3/uptime It would be better if it would show a more precise number.


What do you mean? Your link is showing "Basecamp 3 has been up 99.978% of the time since launch." as the main header. That's pretty precise(;


The text starts with 99%, and the .987 is being added with javascript after page load. If you turn off JS you’ll only see 99.


We also ran into this issue a few months ago. Our problem is our data model for the table in question is a mess, with literally hundreds of foreign keys pointing at that ID column (so all of those tables would need to be altered as well). Migrating all of those would have resulted in an unacceptably long downtime (i.e. multiple days). We started down that path, but quickly realized it wasn't a viable option.

So, plan B... It's a signed integer... The auto increment started at 1 and counted up from there... We literally had 50% of the ID space still available to us. So now we've started at -1 and are counting our way down from there. Obviously we're adding rows a lot faster now than we were at the beginning of time, but it has bought us more than enough time to fix the issue properly.

It took a bit of cleverness to get those negative IDs (MySQL won't let you set the auto increment start point to a value less than the current max ID), but the end result is that we got back up and running in a couple hours, as opposed to the multiple days we initially thought we were looking at.


Stop the world db migrations like this suck. I read the other day postgres can create indexes concurrently, can any db engine do bigger changes like in TFA concurrently? That would be cool.


What does data verification in a large rails project like that entail?


> We should have known better. We should have done our due diligence when this improvement was made to the framework two years ago. I accept full responsibility for failing to heed that warning, and by extension for causing the multi-hour outage today. I’m really, really sorry

I don't know why, but seeing "I’m really, really sorry" in a Post-mortem doesn't feel right.


> I don't know why, but seeing "I’m really, really sorry" in a Post-mortem doesn't feel right.

Maybe your comment just reads wrong (context is tough in text format) but I wish you would reconsider this idea. Here is a business owner accepting full responsibility and doing his best to provide some insight into what went wrong and why it took so long to get back online. It reads to me like someone who cares deeply about his customers and wants to project the sense that they are his number one concern. Its honestly refreshing and I wish it was a more common thing. I'm not a BC customer but I'll be giving them a try next time I am in the market. They clearly care.


>I don't know why, but seeing "I’m really, really sorry" in a Post-mortem doesn't feel right.

Is there anything he could have said that would appease everyone? I think he is really stuck, and at least he took responsibility for what happened.

That seems rare these days, and even if his wording doesn't feel right, his words and the detailed explanation of their actions deserve a lot of respect.


To me, it projects a deep sense of ownership and responsibility, which is right on.


It’s not a postmortem. Second paragraph mentions one will be following soon.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: