The prompting might be an issue, but I think the larger sign that it's not quite...

The prompting might be an issue, but I think the larger sign that it's not quite there yet in terms of symbolic reasoning is that even based on their own paper, gpt-4 gets only a 30 on AMC 10 (out of 150), whereas leaving the test blank would get you a 37.5. And this is on a closed-ended multiple choice-test, so the conditions should be favorable for it to dominate.

Edit: Although this might be unfair considering that LLMs are known to be poor at calculation. (Maybe it would do better at proof-style, like USAMO?). I wonder how well chatGPT with WA integration would do.