I think that's often true (one common example is stl sort versus C stdlib's qsort(), which is often a big win because of inlining a datatype-specific comparison operator), but I think there are quite a few cases where the object code bloat you get from multiplying the code by the number of types it's instantiated for (vs. using a polymorphic/generic function) kills your cache more than enough make up for any optimization win.