I think Bryan Cantrill put it well in one of his more recent talks. Jepsen repor...

aphyr · on June 12, 2019

I agree, but I should also note that many of the recent analyses I've published were on systems which wrote their own tests and missed significant errors. Like, "database loses updates frequently in normal operation" significant!

I think a bunch of factors are in play here. Sometimes vendors discover bugs with Jepsen, then adjust the test so that it passes, rather than fixing the underlying bug. Consul, IIRC, found errors with Jepsen, and adjusted a timeout parameter to prevent the test from observing concurrent primary nodes, which meant the test passed. The underlying problem was still there--but no longer visible. Sometimes you'll see people turn a workload that does a read-modify-write transaction into a single-statement update transaction, in such a way that the database can execute it atomically. That's still a legal and sensible test, but may not measure the same type of transactional isolation.

Other times, vendors write a custom checker, but haven't tested to make sure that it can catch the anomaly that it's supposed to. Maybe a checker is copy-pasted from another test I wrote for a different system which used different names or data structures to represent histories, so it executes successfully, but in a trivial way--maybe it fails to realize any elements were added to a set, so it assumes every read is legal, since all 0 elements are present!

Other times tests are correctly written, but only run for a short time--maybe the nemesis schedule they designed doesn't actually cause leader elections to take place, so you never observe that phase change in the system. Or the test only performs a handful of operations and sits idle until the end of the test. Or every request fails, and the test passes trivially, since we never observed an illegal result. Some of these mistakes I can address with better analyzers and prompting users with guided error messages. Others you have to qualitatively infer by looking at the graphs and histories. That's something I work on training people to do in my classes, but I haven't written good guides for it online yet.

And of course, there's a familiarity problem: with Jepsen, I've been working on and with this tool for six years, so I'm intimately familiar with its behavior and testing philosophy. So much of that knowledge is implicit and intuitionistic for me, and users who are adopting that tool fresh don't have the benefit of that experience! It's something you can build with time, and I can help transfer with writing and teaching. Working on it! :)