> I disagree. People implemented those systems, so if you are correct that it is the systems fault, then it is also a persons fault.
How do you make sure that mistakes don't happen, then? Do you blame and fire people who make mistakes, and hope that the next person put in the same spot doesn't make a mistake? Or do you figure out what caused that person to make the mistake and ensure there are processes in place so that next time this is less likely to happen?
Extrinsic motivators like 'we will give you a bonus' or 'we will fire you' are surprisingly bad at getting people to not fuck things up.
Lets hope you don't ever go into management. You clearly have no idea how to motivate and retain people or have any insight on how hard it is to hire good people to begin with. And no, I'm pretty certain this is not how Netflix's culture is.
Riiiight... Anyways, you kept complaining of being downvoted, here's a clue: you're being an ass and no one likes you or what you have to say because you're wrong. So go scurry back to reddit where you belong troll...
> you're being an ass and no one likes you or what you have to say because you're wrong. So go scurry back to reddit where you belong troll...
Okay? some proof please? This is not far off from a baseless character attack which isn't really effective when trying to convince me about your point on you knowing about Netflix's culture.
If you really want a proper answer, the truth is, unfortunately for you I am in management (previously was an engineer) and have always known Netflix to have a stellar performance oriented (and fear driven) culture, their playbook operates like a sports team. Not for everyone, but that's the point and it works for them.
Maybe you should look inward to yourself if you're so vexed with me to call me silly names, that you can't handle the truth or the culture about why some companies like Netflix adopts this.
You think downvotes and character attacks present as a good argument? Doesn't count as proof IMO if there isn't a valid argument presented, you're going to have to do a lot better than that.
And back to the main point, So I assume you agree that Netflix did go completely down the other day then right? It seems according to you that you know better of Netflix's management culture.
> I'm pretty certain this is not how Netflix's culture is.
Would you be willing to share your expert insight of this if you know better then?
I'm not arguing Netflix, its mostly your attitude towards management and engineering culture. Basically your reply to the user "q3k". "Extrinsic motivators like 'we will give you a bonus' or 'we will fire you' are surprisingly bad at getting people to not fuck things up". You don't fire people just because they made a mistake. You find out what caused it, how to prevent it in the future, and you move on. That's what blameless post-mortems are about. No one is perfect and if you really are a manager that expects perfection, you really just suck as a person.
But now getting back to Netflix, they have post-mortems and they don't fire people willy-nilly over mistakes. Sure it's not hugops (a term I don't care for either), but they don't just up and fire people over a mistake. I never said anything about netflix going up or down on that day, but they also have problems just like everyone else. Their SLA is not 100% uptime and neither is Fastly.
In closing, you are being a pedantic little bitch who wants to argue minutia and I'm done with your trolling. I'm done responding to you, feel free to have the last reply as I really don't care anymore.
v2. "The issue was caused by a previously unidentified pathway that caused a feedback loop and overloaded our servers in a cascading fashion (or whatever). We have implemented a fix for this and updated our testing and deployment processes to stop similar cascades."
Which solves the problem long term?
As an architect making product choices, v2 wins every time.
(With the caveat that if the cause was something that reveals a fundamental problem with the larger processes/professionalism/culture of the company, especially to do with security concerns, then I'm not buying that product and migrating away if we already use it.
If an employee does something actively malicious, you should absolute remove them. This is very rare though - incompetence /broken systems is much more likely.
Otherwise you develop internal process that's entirely scar tissue, and only stops your teams doing their jobs.
I feel it is somewhat obvious and goes without saying that malicious action results in personal responsibility & repercussions. However I don't have any evidence or past experience that malicious action by an internal employee is a likely scenario for most outages. It may well occur but most examples I've heard of seem apocryphal.
The scar tissue: this is where good choices come in because it's certainly not a rule that a change as a result of an incident review is an impediment to work. These definitely occur, and sometimes linger after the root cause is phased out. But best practices often reduce cognitive & process overheads.
A rough example is that there are still people out there FTPing code to servers, having to manually select which files from a directory to upload. Replacing this error prone process with a deployment pipeline leads to a massive reduction in the likelihood of errors and will actually speed up the deployment process. It's all about making the right choices, not knee-jerk protections, and sometimes the choice is to leave things as they are.
> People must be held accountable to have good incentives to reduce such outtages in the future.
Holding specific people "accountable" for outages doesn't incentivize reducing outages; it incentivizes not getting caught for having caused the outage.
As a result, post-mortems turn into finger-pointing games instead of finding and resolving the root cause of the issue, which costs the company more money in the long run when a political scapegoat is found but the actual bug in the code is not.
Loss of trust in a service provider and the afterwards loss of business is quite an incentive. Having someone drawn and quartered just provides an incentive to scapegoat.
YC never claimed to have a perfect selection process. I wouldn’t hold them accountable for the teams they pick based on a business plan and a 10 min discussion. Its just a good guess.
If they doubled down on the team after demo day, then I am with you.
Compare this with hiring. I have heard successful people say they do get 1 in 5 hires wrong, even after a process which is 10x more thorough.
50% is anti-big-tech and 20% is our fear of being made redundant.