Instruction Cache and TLB trashing is an often overlooked consequence of code bloat and sometimes of overly aggressive micro-benchmark driven optimization.
Reorganizing the binary is an interesting approach to minimize the cost, but I think that any performance oriented developer should keep in mind that most projects are rarely dependent on a single hot loop but on many systems working together and competing for space in the cache(s).
I generally use -Os instead of -O2 and -O3 in my projects, while trying to reduce code bloat to a minimum for that reason.