One trick could be to run the program in a "lazy" way. E.g. don't run a statement like "a=b+c", but evaluate it only when a is needed. This would require a complete and automatic rewrite at the assembly level, but you wouldn't be doing anything that you don't need. Then from this you could determine the "hot" paths, and optimize those for speed (translate back into non-lazy form).