GPT4 can not extrapolate real information from less. Reverse engineering large o...

duskwuff · on March 26, 2023

TBH, an LLM might be decent at code identification -- looking at some assembly and saying "that looks like a CRC32 hash", for example. That's a task that dovetails fairly well with its strong pattern-matching abilities. Making larger statements about the structure and function of an entire application is probably beyond it, though.

Moreover, it's likely to fail in any sort of adversarial scenario. If you show it a function with some loops that XORs an input against 0xEDB88320, for example, it would probably confidently identify that function as CRC32, even if it's actually something else which happens to use the same constant.

flangola7 · on March 27, 2023

All the real information is already in the binary, no guessing is necessary. It takes data, processes it through a set of defined steps, and outputs it. Both the C code, the assembly code, and the obfuscated assembly code, express the same fundamental conceptual object.

If you have a good enough model with a large enough token window to grasp the entire binary, it will see all of those relations easily. GPT-4 already demonstrates ability in reverse engineering, and GPT-5 is underway which if it as powerful of a generational jump as 3 to 4 will advance these abilities tremendously.