The problem is not (only) the context size, it's (for lack of a better word) foc...

The problem is not (only) the context size, it's (for lack of a better word) focus. GPT4 can easily get lost in too much information and produce results that don't work well (duplicate code or just incoherently solving a problem), and it needs a lot of handholding.

Imagine GPT4 (or any other LLM) as a very eager but not very bright junior developer that just started to work with you. It's good, but it'll need a lot of situational management for it to not go wildly off the track.

What improved models will bring us in the future is making Pythagora and other such tools work better for large and more complex projects. The tipping point will come when for example you'll be able to load Pythagora in Pythagora and continue development. While we do build some auxiliary/external tools with Pythagora, the core is still handcrafted mostly by a human, and I'm pretty sure that's the case with other code gen tools as well.