Could one take real recorded noise and add that rather than noise generated via algorithm? Wouldn't that force attackers to solve a real problem (removing background noise from an speech sample)?
It's not really solving the "real" problem... If I'm just mashing two audio files together, that's going to be different than someone talking in the middle of a train platform and there will likely be algorithmicly-determinable difference from the artificially generated words and the naturally generated noise.
All of this aside, removing background noise is not a huge issue anymore. We have pretty decent noise-cancellation technology. Speech recognition - the other big component - has advanced a lot in recent times and is actually pretty good, although not for every company/product.
Even if it would be helpful, you'd have to record an incredible amount of noise in the first place, seeing as you're getting millions of hits a day and if you have a small sample set, the attackers will just figure out the solutions to that sample set and be done.
I'm not saying it's impossible, but I am saying it's probably not worth it at this point. Captchas (in their traditional forms) don't make sense as a long-term strategy anyways.
If the attacker manages to obtain all the random noise, they could index every window in the noise in a k-d-tree and perform an efficient nearest neighbour search for the exact background from the CAPTCHA audio, and then simply subtract the background, giving perfect segmentation in O(log(N)) asymptotic average time complexity for N windows (at 64kHz and 2000 hours of audio, N=460800000, log N = 19.95).