Really not! This is a huge faceplant for writing things in Rust. If they had been writing their code in Java/Kotlin instead of Rust, this outage either wouldn't have happened at all (a failure to load a new config would have been caught by a defensive exception handler), or would have been resolved in minutes instead of hours.
The most useful thing exceptions give you is not static compile time checking, it's the stack trace, error message, causal chain and ability to catch errors at the right level of abstraction. Rust's panics give you none of that.
Look at the error message Cloudflare's engineers were faced with:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
That's useless, barely better than "segmentation fault". No wonder it took so long to track down what was happening.
A proxy stack written in a managed language with exceptions would have given an error message like this:
com.cloudflare.proxy.botfeatures.TooManyFeaturesException: 200 > 60
at com.cloudflare.proxy.botfeatures.FeatureLoader(FeatureLoader.java:123)
at ...
and so on. It'd have been immediately apparent what went wrong. The bad configs could have been rolled back in minutes instead of hours.
In the past I've been able to diagnose production problems based on stack traces so many times I was been expecting an outage like this ever since the trend away from providing exceptions in new languages in the 2010s. A decade ago I wrote a defense of the feature and I hope we can now have a proper discussion about adding exceptions back to languages that need them (primarily Go and Rust):
That has nothing to do with exceptions, just the ability to unwind the stack. Rust can certainly give you a backtrace on panics; you don’t even have to write a handler to get it. I would find it hard to believe Cloudflare’s services aren’t configured to do it. I suspect they just didn’t put the entire message in the post.
tldr: Capturing a backtrace can be a quite expensive runtime operation, so the environment variables allow either forcibly disabling this runtime performance hit or allow selectively enabling it in some programs.
It's one of the problems with using result types. You don't distinguish between genuinely exceptional events and things that are expected to happen often on hot paths, so the runtime doesn't know how much data to collect.
panic is the exceptional event. It so happens that rust doesn't print a stacktrace in release unless configured to do so.
Similarly, capturing a stack trace in a error type (within a Result for example) is perfectly possible. But this is a choice left to the programmer, because capturing a trace is not cheap.
There's clearly a big gap in how things are done in practice. You wouldn't see anyone call System.exit in a managed language if a data file was bigger than expected. You'd always get an exception.
I used to be an SRE at Google. Back then we also had big outages caused by bad data files pushed to prod. It's a common enough issue so I really sympathize with Cloudflare, it's not nice to be on call for issues like that. But Google's prod environments always generated stack traces for every kind of failure, including CHECK failures (panics) in C++. You could also reflect the stack traces of every thread via HTTP. I used to diagnose bugs in production under time pressure quite regularly using just these tools. You always need detailed diagnostics.
Languages shouldn't have panics, tbh, it's a primitive concept. It so rarely makes sense to handle errors that way. I know there's a whole body of Rust/Go lore claiming panics are fine, but it's not a good move and is one of the reasons I've stayed away from Go over the years and wouldn't use Rust for anything higher than low level embedded components or operating system code that has to export a C ABI. You always want diagnostics and recoverable errors; this kind of micro-optimization doesn't make sense outside of extremely constrained embedded environments that very few of us work in.
An uncaught exception in C++ or an uncaught panic in Rust terminates the program. The unwinding is the same mechanism. I think the implementation is what comes with LLVM, but I haven't checked.
I was also a Google SRE, and I liked the stacktrace facilities so much that I got permission to open source a library inspired from it: https://github.com/bombela/backward-cpp (I know I am not doing a great job maintaining it)
At Uber I implemented a similar stackrace introspection for RPC tasks via HTTP for Go services.
You can also catch a Go panic. Which we did in our RPC library at Uber.
It would be great for all of that to somehow come ready made though. A sort of flag "this program is a service, turn on all the good diagnostics, here is my main loop".
The most useful thing exceptions give you is not static compile time checking, it's the stack trace, error message, causal chain and ability to catch errors at the right level of abstraction. Rust's panics give you none of that.
Look at the error message Cloudflare's engineers were faced with:
That's useless, barely better than "segmentation fault". No wonder it took so long to track down what was happening.A proxy stack written in a managed language with exceptions would have given an error message like this:
and so on. It'd have been immediately apparent what went wrong. The bad configs could have been rolled back in minutes instead of hours.In the past I've been able to diagnose production problems based on stack traces so many times I was been expecting an outage like this ever since the trend away from providing exceptions in new languages in the 2010s. A decade ago I wrote a defense of the feature and I hope we can now have a proper discussion about adding exceptions back to languages that need them (primarily Go and Rust):
https://blog.plan99.net/what-s-wrong-with-exceptions-nothing...