I've done a lot of work in ML numerics, and I think TF32 is a completely safe drop-in for FP32 for ML workloads. NVIDIA seems to think so too, which is why on A100 it won't even be an option, it will be the default mode for any FP32 matrix multiplies.
But on 3090, I don't think the speedup will be 5x, it should be closer to like 2x. The 3090 has 35.6 TF/s at TF32 and the Titan RTX has 16.3 TF/s at FP32. Once again I think there is handicapping going on for 3090.