CrowdStrike's outage should not have happened

beardedwizard · on Aug 8, 2024

This outage says more to me about the state of software engineering in 2024 than it does about crowdstrike. Starting with the fact that software like crowdstrike exists to compensate for even poorer software rife with exploitable vulnerabilities. It is certainly hard to defend crowdstrike, but is even harder to hear so many hot takes when the engineer emperor has no clothes.

Critical software engineering is a race to the bottom across many domains. Healthcare, banking, flight systems, etc.

raggles · on Aug 8, 2024

Yep, I'm an Electrical Engineer, and I also write code. I don't like the term Software Engineer, because there is none of the of the regulated safety and quality mechanisms required for software that are normally associated with professional engineering.

theideaofcoffee · on Aug 8, 2024

Same, which is why I bristle at my title containing "engineer" as I don't have a PE. If most software engineers want to legitimately call themselves engineers, the field should be formalized as an engineering discipline, including coursework, certification/licensure and, better yet, apprenticeship-like experiences required for "real" engineers working toward their Professional Engineer license.

Edit: I'd add this goes double when working on safety-critical code, or anything touching protected health data, or payment/financial data. It's just too toxic and valuable to leave to a chance change.

pxc · on Aug 8, 2024

> If most software engineers want to legitimately call themselves engineers, the field should be formalized as an engineering discipline, including coursework, certification/licensure and, better yet, apprenticeship-like experiences required for "real" engineers working toward their Professional Engineer license.

I agree, although in reality it's not chiefly developers themselves who are responsible for quick, lazy approaches, is it? Developers are typically the parties most pained by technical debt. If the discipline of software development is to become software engineering in earnest, there will have to be some pressure all the way up the management chain— pressure strong enough to outweigh software's low cost of iteration. I imagine this is really rare outside of highly regulated industries and very specific applications, and even with a formalized software engineering discipline, many companies will prefer sloppy software development and many competitive markets will 'select for' such companies.

ethbr1 · on Aug 8, 2024

> although in reality it's not chiefly developers themselves who are responsible for quick, lazy approaches, is it? Developers are typically the parties most pained by technical debt.

I'd agree with you, except... ooh, a new, shiny, untested language / framework / platform to rewrite the codebase in!

hello_moto · on Aug 8, 2024

Yup, like for example: Rust!

pxc · on Aug 8, 2024

Rust isn't really new or untested, though. I issue with RIIR isn't Rust so much as the act of rewriting, which carries inherent risks.

I think the temptation to rewrite also reflects how messy and unworkable we let codebasee get— sometimes that impulse is more about the pain of working with the existing codebase than anything else.

ethbr1 · on Aug 8, 2024

To me, the tragedy of rewriting is the underappreciation of why the gnarliest 5% of the codebase is there.

Occasionally, usually because initial requirements were sorely lacking or changed, you can simplify the system via rewrite.

More often, everyone ends up realizing they didn't actually understand that last 5% of edge cases.

And then you've either replaced the working system with a 95% complete solution (so common in modern software) or you produce a system equally ugly once you handle that last 5%.

paulddraper · on Aug 8, 2024

> I bristle at my title containing "engineer" as I don't have a PE

Lol HN.

Outside of civil/structural disciplines, PE is not required for engineering.

Mechanical, Chemical, Electrical, Nuclear don't require it.

I've literally never met an Aerospace engineer with a PE, and they build planes 'n sht.

---

It's a pure resume padder, like Cisco or AWS certifications.

randomcarbloke · on Aug 8, 2024

America's obsession with professional licensure is so confusing, it's an absurd level of protectionism for a country that claims to love freedom, do you really need a license to cut hair etc?

Some programmers (evident in the replies) even think it should exist for their profession, a very worrying idea.

theideaofcoffee · on Aug 8, 2024

"Nah, those uppity software engineers don't need any additional training. How dare they think they have any right to tell us how to manage this roll-out" says management, "We'll just cowboy this update and wipe out 8.5 million systems. Besides, it's not like they're working on nuclear reactors, yeah? How complex could it be?"

You're just proving my point in that it's a CTO that dismisses the argument in a rather childish way. They would be the one to be told 'no' by the now-professional software engineers when their license and livelihood is on the line while being pressured to do something that goes against their recommendation. Funny how that power dynamic changes when there's something real on the line and not just an inflated title, huh?

Perhaps if those aerospace and software engineers that attempted to blow the whistle at Boeing were successful and were empowered via their license to say enough and stop development on MCAS, there wouldn’t be 300+ dead people because of a software change rammed through by management. The licensure ain’t just window dressing. It has real, actual impact on real human lives. Don’t be so dismissive.

ffsm8 · on Aug 8, 2024

He isn't dismissive about the problem, he's dismissive about the proposed solution.

A certificate would not change the status quo at this point.

Software developers/engineers are - for the most part - seen as essentially blue collar workers. Replaceable gears that MBAs can just "scale up" or "down" to fit their currently desired velocity. Let's ignore the fact that this fundamentally isn't true, but it's what they believe.

The work they do is decided by MBAs, and the time they have to implement these changes is heavily influenced by other MBAs.

Adding a certificate to this mix will change literally nothing

paulddraper · on Aug 8, 2024

I'm dismissive of

> I bristle at my title containing "engineer" as I don't have a PE

because it is aggressively ignorant of facts.

I have no comment on the potential utility of a PE requirement, software or otherwise.

imiric · on Aug 8, 2024

A PE license isn't just for certifying a minimum level of technical competence. It also requires aligning with a specific code of ethics, to ensure safety and wellbeing of the public.

If there's anything that the software industry needs most is a code of ethics. Companies are built on software that exploits, tricks and deceives their users. They release borderline malware and get rich doing it, either by having complicit investors or fooling them with false valuations. They cover their asses with dishonest PR, and lobby governments to keep the party going. This happens in the largest tech giants and tiny startups alike. And don't get me started on the gaming industry and their predatory practices.

We often exculpate engineers as being cogs in the machine, but they're ultimately choosing to work in these places, and enable this behavior.

The world would be a much better place if software engineers were required to take and uphold the equivalent of the Hippocratic Oath. We don't expect less from health professionals. Why should we from IT ones when the world is run by software?

paulddraper · on Aug 8, 2024

Software engineering has as much a code of ethics as mechanical engineering.

Maybe there's improvement to be had, but this is not a difference between disciplines.

imiric · on Aug 8, 2024

Huh? Please show me a code of ethics taught in software engineering courses.

If one exists, I would like to see your argument that SWEs are adhering to it, and that the software industry is behaving ethically.

raggles · on Aug 8, 2024

I think this is changing, in my country the equivalent of PE will soon be required for a number of engineering disciplines, not just civil. But that's kind of missing the point, which is that professional engineering is guided by fairly comprehensive standards, such as those from the IEC (or ANSI, IEEE), and compliance with those standards is generally a legal requirement.

TacticalCoder · on Aug 8, 2024

> ... because there is none of the of the regulated safety and quality mechanisms required for software that are normally associated with professional engineering.

And then a lot of "real" engineering now require software anyway: self-driving cars (written in part by people who are hacking together webapps by pulling in thousands of NPM dependencies) comes to mind.

The future is honestly a bit scary looking.

On the bright side things are going to get "interesting". At some point in the past we had many "Uber but for ...". Soon we'll have "Clownstrike but for fridges", "Clownstrike but for cars", etc.

Should be fun.

threatofrain · on Aug 8, 2024

An engineer is just a distinction to separate practitioners from people who develop the actual science that engineers depend on. For example, do American engineers who work on food or cosmetics have safety or quality mechanisms? Food and cosmetics are things applied to the body, and thus have a more direct relationship to biological safety or harm. Using science is not the reason why people should face regulations. Some engineers work on legos. Other engineers work for the armed forces.

The Crowdstrike incident is worth billions and people may have died. If you look to the engineer you won't be able to recover billions. Hospitals must absolutely be on the hook as they are the direct interface to their customers; hospitals in turn can sue Crowdstrike.

An event worth billions must have billions in liability in order to prevent perverse incentives. Otherwise hospitals will just say "well McKinsey said it was a good bet, so what gives?"

_heimdall · on Aug 8, 2024

While I agree based on the quality of software development I've seen over 15 years in the industry, I don't think the hard requirement should be a regulatory structure.

That absolutely can work, and does for plenty of industries, but it also creates the potential for a false sense of security until planes start falling out of the sky.

My frustration, and disappointment, in the software industry has generally been the complete unwillingness at scale for us to take on the responsibility to ensure safety and reliability without regulations enforcing it. Plenty of this responsibility (blame?) falls on companies led by individuals who are solely focused on profit and self-interest, but we have to own some of the responsibility as we're the ones agreeing to write and ship bad code.

nostrademons · on Aug 8, 2024

There are plenty of software systems built for security (eg. OpenBSD, Haskell @ Galois, CapROS), but by-and-large customers don't use them. Shiny new features brought quickly to market seems to beat out security and reliability every time. This pattern seems to extend into other industries that have adopted software as well, eg. the auto industry is in the process of transitioning from shipping highly reliable cars that just drive to shipping computers on wheels that frequently can't go.

Understanding why this happens would be an interesting research project. Part of it might be information asymmetry with customers (shiny new features are very visible at sale time and reliability is totally unknown, so customers tend to weight known features over unknown reliability), and part might be principle agent issues (the decision maker who bought the software will have collected their bonus and retired long before the data breach can be attributed to them), and part might be that the market simply hasn't caught up to the negative consequences of all this change and careless companies will be purged by the market in the future.

I'm not terribly fond of regulation as a solution either. It tends to overconstrain industries, prevent innovation, and leave a hole at the lower end of the market that eventually makes products unaffordable. But there should be some quality mechanism that incentivizes decision makers to do the right thing and invest in quality even when there's a cost in features.

_heimdall · on Aug 8, 2024

Removing legal protections for corporations and those in charge would go a long way. For example, if those in charge are personally liable for wrongdoing they would think twice. CrowdStrike as a product may not even exist as it is in that world, a company leader may not want the personal risk of being able to take down a large chunk of the internet. There also may not be security holes to guard if the leaders of an OS company weren't willing to skip security in favor of fancy new AI features.

dopylitty · on Aug 8, 2024

This is a nice idea but the problem isn't the individuals in the system but the system itself. As long as shareholder value/profit is the only factor companies consider this is the end result you will get.

Management is just making decisions based on what the companies value and companies are just valuing what their shareholders value which is more money for the shareholders.

The best way to fix it would be to reform the stock market system so that companies aren't beholden to uninvolved third parties looking to make a quick buck. Only active employees should own stock in companies and sit on company boards.

This would also require reforming the retirement system so retirement money isn't just dumped into the stock market. It needs to instead go somewhere safe and just sit. Retirement funds being in the stock market creates a huge inflationary feedback loop by demanding constant increases in profits which cause companies to raise prices which causes retirement funds to need to be bigger which causes them to demand more profit increases.

_heimdall · on Aug 8, 2024

I'm not opposed to stock market reforms, I'm sure there's good that could be done there. Even with today's stock market setup and companies' fiduciary obligations, if a company could be meaningfully financially by legal actions they would think twice.

Take CrowdStrike for example. If the company and its leadership wasn't so well shielded from financial and legal liability they likely wouldn't have had a process that allowed rolling out an untested update to the entire world at once. Instead, they have a CEO that did effectively the same thing at McAfee before allowing it at CrowdStrike and the company will likely get little more than a financial slap of the wrist.

Would it solve everything? Absolutely not, and other actions like changes to the stock market could help. But it surely would make a difference if leadership and companies knew they could actually be ruined if they are provably negligent or culpable in damages caused.

ThrowawayR2 · on Aug 8, 2024

> "Removing legal protections for corporations and those in charge would go a long way. For example, if those in charge are personally liable for wrongdoing they would think twice."

Cute fantasy about pinning everything on management but people do remember the old adage that "shit rolls downhill" don't they? What that will result in is very onerous processes and certifications mandated by "those in charge" on the people at the bottom to generate ironclad proof of no wrongdoing, at least for themselves. Maybe that is ultimately what this industry needs but it is also going to result in a work environment which really sucks a lot.

_heimdall · on Aug 8, 2024

It isn't about pinning everything on management, that's just as unfair to them as today's setup is for everyone else.

When management actively makes decisions to prioritize profit over security, for example, they should be held personally liable when a security issue occurs. I'm not really sure what a reasonable argument for that not being the case would look like.

If such a setup did result in a shitty work environment, people would ultimately have the option to not work at certain companies or to work for themselves. We can't assume that people must work for big tech and limit ourselves to what works in that sandbox.

nostrademons · on Aug 8, 2024

A lot of things besides CrowdStrike would not exist in that world, notably Windows and your electric utility. Some people might consider that an improvement, but beware unintended consequences.

_heimdall · on Aug 8, 2024

Are you assuming that Windows and electric companies are all run by people knowingly making decisions to cut corners and put the company at risk?

Leaders of an electric company shouldn't be held liable for a lightning strike that starts a fire, for example. But they should be held liable if they purposely decide not to spend the money it takes to maintain power lines and a tree branch that should have been trimmed falls and starts a fire.

There would be consequences of such a system that change what we have today, but I wouldn't expect that to mean we couldn't possibly have things like electric companies.

jordanb · on Aug 8, 2024

My perfect example of the failure of our industry to maintain any professional standards is the widespread use of YAML.

This is used as a configuration and data exchange format despite having no formal definition, resulting in different results based on the parser used, and a weak typing system that has caused many bugs in many applications that use it. This despite the fact that many better, more reliable configuration and data interchange systems existed even before YAML got popular.

beardedwizard · on Aug 8, 2024

I have often blamed software engineers for being complicit, however we should avoid a system that forces a worker to bear the cost of this choice in the first place.

_heimdall · on Aug 8, 2024

That's a bit of a chicken and egg problem, isn't it? We can only avoid a system that forces a worker to bear the cost if they first decide to bear it.

beardedwizard · on Aug 8, 2024

If we force people to choose between paying the bills and cutting corners we know what happens - we have seen this movie many times in history.

I prefer the idealistic view that each individual can make a change through choice, but the reality is that choice is a privilege that isn't evenly distributed across the population. For example some can afford to not shop at Walmart, others can not - paradoxical as it may be from a local economics perspective.

Regulation is the typical blunt instrument to move the incentives to the business leaders rather than the individual. Other commenters don't think regulation is the answer, but I think most agree doing nothing won't change the status quo soon enough.

_heimdall · on Aug 8, 2024

> For example some can afford to not shop at Walmart, others can not - paradoxical as it may be from a local economics perspective.

While I personally agree with the sentiment of your comment in general, this piece really is part of the blind spot in my opinion.

The assumption here is that everyone has to get all of their for from a grocery store, and the only question is what quality of products you can afford. It doesn't have to be that way, and wasn't until very recently in human history.

We almost always have alternatives. They just often seem so extreme as to not be feasible. People can grow their own food though. And at least in the US, we could go without a huge portion of the crap we spend money on every year. We just choose not to. There's absolutely nothing wrong with that choice, but its important to realize it is a choice.

beardedwizard · on Aug 8, 2024

I can see your point. What I was hoping to highlight is slightly different which can be illustrated through your comment on growing your own food - you need the privilege of both time and space to even do that. Lacking both you may be forced to choose something that harms your long term interests like shopping at Walmart and putting local grocers you can't afford out of business.

A good example of this is an urban single parent of multiple kids, time and space are likely very scarce and choices are limited.

eddd-ddde · on Aug 8, 2024

Very interesting. I'm fairly young and I have never associated the term engineering with regulation. When I think engineering I just think problem solving.

petermcneeley · on Aug 8, 2024

"none of the of the" Genius level comment that demonstrates the need for a quality review of ones work.

InvaderFizz · on Aug 8, 2024

I've worked for security companies in different market segments to CrowdStrike. I was not at all surprised by how this went down. The "engineering" practices across the security vendors is practically Rube Goldberg of crap stacked on top of crap.

nostrademons · on Aug 8, 2024

It reminds me of when Tavis broke Sophos [1].

For those who weren't around, Sophos antivirus was the leading "enterprise-strength" security software around. When I started at Google in 2009, all corp-issued Windows laptops had it. Sometime in 2011, we got an internal email to immediately install a critical system update and reboot our machines. When they rebooted, they didn't have Sophos anymore. Then came another e-mail that said that Windows laptops would only be renewed for critical business exceptions, and everybody else was supposed to switch over to Mac or Linux laptops instead. Note that the first Chromebooks started shipping 3 months later.

A couple days later the full report came out, and a partial version was released to the public. The tl;dr was that Tavis Ormandy (then Google's lead security researcher) had done some cursory pentesting on Sophos, and found that it had so many security holes that it was architecturally impossible to make secure. Rather than attempt to bandage the problem, the company decided it was better to give up on Windows entirely.

[1] https://www.forbes.com/sites/andygreenberg/2011/08/04/google...

scotty79 · on Aug 8, 2024

You assume that using Crowdstrike software was a sound business decision before outage. It might have been just well marketed snake oil that was adding very little protection and just didn't hurt until the outage.

farhanhubble · on Aug 8, 2024

I get to hear "but software engineering is way more complex" a lot but the truth is that in the software engineering world, it is very easy to build a mound out of cow dung and hairpins and pass it off for a castle and never be questioned about it. Software is complex and further obscured by wrappers such as cloud services ensuring very low accountability.

ldjkfkdsjnv · on Aug 8, 2024

Software engineering at most companies is just people pounding if statements into codebases until the errors stop. Its not sophisticated, and a 1000 monkeys banging away does end up producing products that mostly work

hello_moto · on Aug 8, 2024

It's true, in 2024 we're still stuck with HTML and JavaScript.

That's how _bad_ things are.

katzinsky · on Aug 8, 2024

Handing a third party company what is essentially a kernel level shell on all your equipment is absolutely insane.

I bought Cloudflare stock after this because it's obvious to me their customers aren't really thinking.

ethbr1 · on Aug 8, 2024

>> No staged deployment {changing to} Add staged deployment

That's the thing that amazed me.

How do you regularly YOLO patches worldwide to something that runs with enough permissions to crash a system?

I don't care if this was a configuration update vs a new sensor capability -- universal rollout should never have been allowed by CrowdStrike's release team.

dataflow · on Aug 8, 2024

My speculation is they probably did stage the rollout to some extent, but didn't have a viable or fast enough feedback mechanism to let them know there was a kernel crash. That seems much more plausible than the engineers being incompetent enough to not have staged rollouts at all. Or they might have had it, but only for code rather than data.

aeyes · on Aug 8, 2024

Crowdstrike in their own RCA wrote that they had no staged rollout for these types of updates: https://www.crowdstrike.com/wp-content/uploads/2024/08/Chann...

And I believe it because for administrators there is no configuration to delay the rollout of these "content updates", you can only delay the sensor updates.

jeberle · on Aug 8, 2024

I think that is being too charitable. The problem is 100% reproducible. The machine blue-screens at boot up. What kind of staging environment allows this to go through?

ethbr1 · on Aug 8, 2024

Staged rollout in this case would be releasing it to a small portion of your customer base, then additional customers, etc.

enragedcacti · on Aug 8, 2024

My outsider's guess is that whoever was making decisions at that level was high on their own supply and decided that a staged rollout would result in time where they weren't protecting all those other machines from imminent and certain catastrophe.

I'm still awestruck that any engineer would be willing to ship code in that setup, though I guess its also possible that they were being misled about how much testing was going on at QA.

ethbr1 · on Aug 8, 2024

That was my initial reaction too.

Given it's a security product, you'd want everyone protected on Day 0 that you have a new release, no?

Except on Day -1, no one was protected.

So what's magical about the day CrowdStrike decides to ship an update?

I can imagine there are possibly scenarios where mass-release would make sense (aggressive vulnerability spreading rapidly), but that can't be every day, can it?

eugenekolo · on Aug 8, 2024

Sentiment might hold some merit, but this article is 80% copy pasting from an RCA report and 2 sentences saying nothing more than "This shouldn't happen" while offering no alternative or deep thought into improvement...

b-man · on Aug 8, 2024

I did provide an alternative. Formally verify their software, making implausible for such error to occur. I have linked to a peer reviewed article that goes in-depth about such.

dwheeler · on Aug 8, 2024

The Crowdstrike report explains why it crashed, but not how it passed final end-to-end testing. There appears to have been many tests of piece parts (unit testing), but that's not the same as testing the full system.

I would think all the end-to-end tests of the full system would have been instantly detected the problem and prevented it, because it would have failed all the end-to-end tests.

Did I miss something? Did they never test the complete system as deployed? Looks like it, but maybe I misunderstood something.

tadfisher · on Aug 8, 2024

From what I understand, this was a failure in parsing a configuration update file, and while they do end-to-end testing for releases of the Falcon code with the latest configuration, they don't (or didn't) do so against all configuration updates in the interim.

dwheeler · on Aug 8, 2024

In other words, they failed to do end-to-end testing in some cases. They tested pieces, but not the full system they released.

If you don't test each config update, by definition you didn't test it.

Rust has many nice properties, but it can panic too. You should still test the whole system before each release if it's critical.

greenthrow · on Aug 8, 2024

Extremely low quality post by the submitter. Yes these shouldn't happen, but software engineers -- so far -- are all human. It's more useful to talk about the ways this could be mitigated than to just post a few sentences repeating that it shouldn't happen.

megablast · on Aug 8, 2024

It should never be up to the dev team. There should be QA and testing to stop this stuff as well. And rolling updates to catch the system in time.

cangencer · on Aug 8, 2024

The article is extremely shallow beyond saying "Formal verification a la SPARK" should have been used, while not offering how this could actually work in the real world - I don't think the author has any experience working on any similar piece of software either.

While such techniques are available, would they be really applicable in a very dynamic environment such as with millions of PCs running various windows versions, needing continuous / real-time updates.

And yes, we of course know that QA and testing magically removes all possible failure modes/bugs.

b-man · on Aug 8, 2024

> While such techniques are available, would they be really applicable in a very dynamic environment such as with millions of PCs running various windows versions, needing continuous / real-time updates.

I don't see much difference in complexity between the affected software and the several existing formally verified software. At the very least the parser/interpreter could very much be formally verified.

But my point is, have they tried? They don't seem to be even aware of such.

ThrowawayR2 · on Aug 8, 2024

The industry doesn't want to pay what it costs to support the type of SDETs capable of doing QA on kernel drivers. Anyone that talented is going to do the math and switch to being a developer for a much bigger paycheck instead.

b-man · on Aug 8, 2024

I have proposed a solution, and linked to a peer reviewed paper that goes in depth about it. What else can I do?

siliconc0w · on Aug 8, 2024

What commonly happens in these organizations is they have a software delivery path that has a lot of these best practices but soon people figure out that it is too slow so they invent a new, faster, path. From what I can tell Crowdstrike had a lot of the usual best practices like canary rollouts on their binary but they didn't on this configuration file despite it having the same consequences of a bad binary push. This wasn't even an edge case, it reliably BSOD'd every windows machine that got this update.

One strategy Google SRE uses is that the team ensuring reliability has a different reporting path than the product team - so there is always a check and balance when things like rollout policies get worked around by clever product teams.

It's a shame because I hear it's actually a pretty good product.

zamadatix · on Aug 8, 2024

What is whooshing over my head about "Figure 1"?

hipadev23 · on Aug 8, 2024

All his blogposts have a church image for some reason. This particular one I found amusing and now wonder if it’s intentional:

* Saint Nedelya (church in the blog)

* Satya Nadella (microsoft ceo)

satisfice · on Aug 8, 2024

I’m not concerned about the technical solutions. Any technical solution has to be implemented by people.

The thing not mentioned in CrowdStrike’s report is anything about people— especially management. Bad management and understaffed teams will defeat any technical solution, any day.

halayli · on Aug 8, 2024

> Multiple engineers identified the issue via analysis of stack dumps as being triggered by a null pointer bug in the C++ the Crowdstrike update was written in; it appears to have tried to call an invalid region of memory that results in a process getting immediately killed by Windows, but that take looked increasingly controversial and Crowdstrike itself said that the incident was not due to "null bytes contained within Channel File 291 [the update that triggered the crashes] or any other Channel File."

randerson · on Aug 8, 2024

Nation-state hackers of the world must love the idea of a supply chain that pushes out immediate untested updates to half the US Fortune 500, to be processed by a C++ kernel driver. If CrowdStrike's goal is to secure companies at scale, they could easily be doing the opposite.

luxuryballs · on Aug 8, 2024

I still think it was intentional, someone activated the CrowdStrike feature that was purchased by the DoD.

Maybe people with inside knowledge of recent events were trying to make an exit so they had to smash the glass and hit the red button to stop air travel so they could snag them in time?

Making it a perfect update failure is clever enough, but the name of the product is the best part. Imagine a system that can stop breaches even after they occur ;)

Dwedit · on Aug 8, 2024

So you have an "Index Out Of Bounds" problem. It could either directly lead to reading out-of-bounds memory and generating an Access Violation exception, or you could see the out-of-bounds array access and throw an exception.

Either way, you've got a kernel-mode exception that isn't being caught, and that's a BSOD.

deathanatos · on Aug 8, 2024

You wrap the particular extension such that if it crashes, there's a sensible fallback option, such as restarting it, etc. that doesn't take out the entire system.

Many of the ideas of Tanenbaum would seem applicable towards preventing such a thing.

insane_dreamer · on Aug 8, 2024

That's quite a list of problems with that update; wasn't just a single bug that slipped through the cracks.

999900000999 · on Aug 8, 2024

CrowdStrike outsourced their SDET positions to save a buck.

This is what happens. Stop skimping on QA.

jokoon · on Aug 8, 2024

I'm not a fan of rust, but if microsoft required that those sort of critical software be written in rust, it would be a good thing.

Anything that is doing something sensitive or critical that can crash the system should be written in rust.

If not, insurance companies would be mandated by law to run static analysis on such C++ code.

echelon · on Aug 8, 2024

There is no reason to not use Rust for these systems anymore. It's why the US Government is pushing so much for Rust adoption.

We're going to keep seeing these horror stories until C/C++ go away.

MaulingMonkey · on Aug 8, 2024

>> According to the RCA, the essence of what happened was an index out of bounds

While Rust's behavior of panicing is saner than C++'s behavior of undefined behavior for a typical index out of bounds bug (`slice[index]`), I would imagine it'd still result in BSODs and outages in a kernel context.

Admittedly, safe bounds checking (`slice.get(index) → Option<T>`) might be easier in Rust than in C++, but it's not a given that a RIIR would've avoided this bug.

AdieuToLogic · on Aug 8, 2024

> There is no reason to not use Rust for these systems anymore.

While Rust is a fine language, there are others which may be better suited for systems which have a global impact such as CrowdStrike does.

Ada: https://en.wikipedia.org/wiki/Ada_(programming_language)

SPARK: https://en.wikipedia.org/wiki/SPARK_(programming_language)

I am an advocate for neither and only reference their applicability for this problem domain based on the languages' design intent.

jki275 · on Aug 8, 2024

Rust isn't a panacea, and the US Government really isn't pushing for Rust adoption.

victor106 · on Aug 8, 2024

> It's why the US Government is pushing so much for Rust adoption.

Can you provide any references for this?

AlotOfReading · on Aug 8, 2024

There's been a number of reports released by the government that recommend adopting memory safe languages as opposed to C/C++. The White House put out a memo [0] for example, and NIST has this page [1]. Since there are no other systems languages with widespread developer mindshare, some people see these as primarily recommending Rust.

[0] https://www.whitehouse.gov/oncd/briefing-room/2024/02/26/mem...

[1] https://www.nist.gov/itl/ssd/software-quality-group/safer-la...

echelon · on Aug 8, 2024

> some people see these as primarily recommending Rust.

That's not the reason. The DoD really is focused on Rust [1, 2].

[1] https://www.darpa.mil/program/translating-all-c-to-rust

[2] https://www.darpa.mil/news-events/2024-07-31a

AlotOfReading · on Aug 8, 2024

The federal government is a lot bigger than even the DoD, and the other agencies (e.g. NIST as linked above) have been very careful to make sure they aren't only recommending Rust, even if it's clearly a major alternative.

jimbob45 · on Aug 8, 2024

What about MISRA C and C++ with static analysis tools?

Also all GC languages would pretty easily solve the problem including Java and C#.

AlotOfReading · on Aug 8, 2024

It's perfectly possible to write the crowdstrike issue in C/C++ code that will pass any MISRA checker, because MISRA checkers are unable to verify conformance with the standard. It basically says "don't do this" and leaves it up to the developer to ensure.

Static analysis tools (broadly) are not powerful enough to ensure memory safety for nontrivial programs without a high developer burden in the form of annotations or restricted language subsets.

GC languages are generally not appropriate for kernel level systems programming like crowdstrike was doing. Java has some history in this space, but generally without niceties like garbage collection that typical developers are used to.

RockRobotRock · on Aug 8, 2024

https://www.whitehouse.gov/oncd/briefing-room/2024/02/26/pre...

https://media.defense.gov/2022/Nov/10/2003112742/-1/-1/0/CSI...

kevindamm · on Aug 8, 2024

They may be referring to DARPA's research project TRACTOR:

https://www.darpa.mil/program/translating-all-c-to-rust

timschmidt · on Aug 8, 2024

He's likely referring to this: https://www.whitehouse.gov/oncd/briefing-room/2024/02/26/pre...

notarealllama · on Aug 8, 2024

If you follow CISA vulns and guidelines, it's all industrial controllers getting hacked and pushing for memory safety for like a year straight.

(On a side note, pleasantly surprised to see a Linux kernel android bug today)

keithalewis · on Aug 8, 2024

We're from the Government and we're here to help.

You need a license to operate a backhoe, but not to write software.

MBCook · on Aug 8, 2024

Oh who cares about Rust. This could have been written in many different languages safely, including C/C++. Would it have been easier in Rust? Maybe. But it's also a legacy codebase. So we shouldn't be surprised it wasn't in Rust.

There are SO MANY things that went wrong. They don't even seem to care about half of them.

Ok, it wasn't tested. It wasn't staged. It wasn't a slow roll-out over hours. It wasn't verified from the deployment servers.

How about why is it parsed in kernel space at all? I didn't see that listed.

Why didn't you have a mechanism to catch faulty updates at runtime and flag them so that you'd be able to survive the next boot? Mark somewhere on the FS "I'm loading #19583". If you boot up and find that, it means you failed, so skip it next boot. Succeed? Remove the mark. Tada. Resiliency.

Why are you even doing this stuff yourself? There are Microsoft APIs for a bunch of this stuff that would have helped reduce risk. Oh, because your code is deployed on XP and up and you don't want to have 2 or 3 versions? That went well didn't it. I bet if you were using those you wouldn't have crashed the kernel space.

Do you fuzz? We know you don't. You broke Windows. You broke Macs. You broke Linux. You keep doing this. You're not good at being resilient. You had to face-plant in front of the entire world before you decided to do something. I didn't see fuzzing mentioned.

Those mitigations you propose are not enough. They'll help, but you're still vulnerable. You're supposed to be a _security company_, I'd expect a better list of how to secure software from a security company.

By all means, use Rust or Swift or (insert other memory safe language) if you want. That's smart. But if their existing kernel module had no UB and buffer overruns, it would still be doing things a kernel module shouldn't. The language involved is just a very tiny part.

jiggawatts · on Aug 8, 2024

> ...it means you failed, so skip it next boot. Succeed? Remove the mark. Tada. Resiliency.

Better yet, it's high time that PCs booted like (some) mobile phones do: there's two copies of the OS. The running copy is read-only. The other copy can only be replaced with a monolithic "image" that has is validated with a HS256-based digitally signed Merkle hash. Either the entire image is bit-for-bit valid, or it is rejected.

If booting to the new image fails twice, automatically roll back to the previous known-good image.

This isn't "hard", it isn't even that much work. Heck, Windows can already boot from WIM images and VHDX disk images! The code is already in there, it just needs to be turned on.

MBCook · on Aug 8, 2024

This wasn’t an OS problem though, so that wouldn’t have helped. It was the data their module loaded which they change constantly. That would make it a very bad fit for a sealed system partition.

Also I believe Apple Silicon Macs do what you suggest, at least in part. I don’t know if they can fall back after an upgrade failure but everything is validated and signed and tamper proof.

jiggawatts · on Aug 8, 2024

It was a kernel-mode code issue however. The same design could be applied: customisations get put into duplicated read-only overlay images. Any boot issue automatically triggers a rollback, and a manual option is provided to roll back to the vanilla base image as well in the troubleshooting boot menu.

aeyes · on Aug 8, 2024

> Oh, because your code is deployed on XP and up

Not even, it only runs on 7 / 2008 Server and up.