If you haven't already, LuaJIT's source code (and Mike Pall when asked) is a treasure trove of speedup ideas.
One idea that stood out to me (and which I first saw in LuaJIT, and as far as I know originated with Pall) is: when rewriting loop code, unroll at least 2 iterations of the loop. (The first executes and conditionally continues into the second; the second loops onto itself). So far, just extra work.
However, any kind of constant folding algorithm is now immediately elevated into a "code hoisting out of loop" algorithm at no extra cost - e.g., SSA form gets that kind of code motion.
I'm not sure Python can make much use of that, because it is nearly impossible to guarantee idempotence of operations - but in case you can somehow make that guarantee, that can be very significant for e.g. function name lookups.
A possible way to use that is to have the loop opcode have two branch targets: "namespaces modified" (which goes to the first iteration, which reloads values) and "namespaces unmodified" (which loops at the 2nd iteration, relying on the constant folding and not looking up in dicts again). This could make calls like "a.b.c.d.e.f" require 0 lookups in most iterations of most loops -- but would also require a global "namespace modified" flag.
We have this optimization in PyPy (so yes, Python can do it). There is even a paper [0]. LuaJIT is a great source of inspiration but Python is just a huge language, which makes it much harder.
One idea that stood out to me (and which I first saw in LuaJIT, and as far as I know originated with Pall) is: when rewriting loop code, unroll at least 2 iterations of the loop. (The first executes and conditionally continues into the second; the second loops onto itself). So far, just extra work.
However, any kind of constant folding algorithm is now immediately elevated into a "code hoisting out of loop" algorithm at no extra cost - e.g., SSA form gets that kind of code motion.
I'm not sure Python can make much use of that, because it is nearly impossible to guarantee idempotence of operations - but in case you can somehow make that guarantee, that can be very significant for e.g. function name lookups.
A possible way to use that is to have the loop opcode have two branch targets: "namespaces modified" (which goes to the first iteration, which reloads values) and "namespaces unmodified" (which loops at the 2nd iteration, relying on the constant folding and not looking up in dicts again). This could make calls like "a.b.c.d.e.f" require 0 lookups in most iterations of most loops -- but would also require a global "namespace modified" flag.