You can't "hold everything else constant" because the set of programs that satisfy whatever type system is a proper subset of the set of valid programs.
The claim that admitting a larger set of programs improves reliability goes through several speculative steps: that the extra programs are correct, that they are simpler, that simplicity leads to fewer mistakes, and that those mistakes would not have been caught elsewhere. None of that follows from the language semantics. It’s a human-factor argument layered on top of assumptions.
By contrast, static typing removing a class of runtime failures is immediate and unconditional. Programs that would fail at runtime with type errors simply cannot execute. No assumptions about developer skill, code style, review quality, or time pressure are needed.
Even in practice, this is why dynamic languages tend to reintroduce types via linters, contracts, or optional typing systems. The extra expressiveness doesn’t translate into higher reliability; it increases the error surface and then has to be constrained again.
So the expressiveness argument doesn’t invalidate the claim. It changes the topic. One side is a direct property of the language. The other is a speculative, multi-step causal story about human behavior. That’s why the original claim is neither unfalsifiable nor “just vibes.”
So regardless of speculative human factors, the claim stands: holding the language semantics constant, static typing strictly reduces the set of possible runtime failures, and therefore strictly increases reliability in the only direct, non-contingent sense available.