Ha! that was funny, I wonder though, getting fed tons of code, couldn’t Godbolt leverage code—-> Compiler Obj —-> Assembly as a mean to train an AI decompiler ? Food for thought.
I've always wondered about this. Compilers do a LOT of irreversible stuff. For example, symbol names usually aren't needed (unless you have a reflective language).
Where AI would really shine is reversing the (only seemingly reversible) optimizations. For example, GCC converts "x * 14" into "(x << 4) - x - x". Of course, you can never be 100% sure the programmer didn't actually want "shift left by four followed by two subtractions", but I'm convinced that 99% of the code I write is fairly predictable and statistically similar to whatever giant codebase you train it on.
Throwing AI at the problem might not actually be the worst suggestion. I wonder how the likes of copilot model the AST. Heh, you might even be able to build an approximation of a compiler using AI.
...which don't have binaries. It's easier for Godbolt, since the whole purpose of the website is to compile and show output. If you crawl GitHub you need to compile the projects yourself, much more difficult.
Binaries are freely available from package management repos, with the benefit of having a known toolchain you can tag your ML inputs with. All the package managers I've worked with have a strongly structured "upstream" or "repo" field or similar that you can use to get to the source.