They typically can only prove correctness for specific input data, and then there’s often still some runtime or environment-dependent chance involved which may cause some fraction of the invocations to fail. Is it correct or not if a single invocation succeeds? How can you be sure?