The Metrowerks profiler and linker worked together to optimize locality in the binary, the focus was on PowerPC code. The linker could generate the static call tree, but the profiler could generate a dynamic call tree of what was actually called. Separating out the cold portions of the call tree into portions of the executable that didn't get paged in was the goal.

I worked on the Profiler and I seem to remember that Microsoft was one of the developers that put a bunch of effort into using this to optimize the Office suite on Mac. I remember the release of Word that used it was snappier.

11. ndesaulniers ◴[06 Nov 24 20:39 UTC] No.42068991[source]▶

>>42061819 #

Note: that's autoFDO+propeller. This article is about BOLT.

replies(1): >>42074751 #

12. Cieric ◴[06 Nov 24 20:41 UTC] No.42069021[source]▶

>>42065443 #

Some google searching brought up this. https://learn.microsoft.com/en-us/cpp/build/profile-guided-o... I'm only reading over it now, but I'm going to test it out a bit when I can.

replies(1): >>42071233 #

13. ◴[06 Nov 24 20:55 UTC] No.42069221[source]▶

>>42061819 #

14. stephc_int13 ◴[06 Nov 24 21:52 UTC] No.42070059[source]▶

>>42005429 (OP) #

Instruction Cache and TLB trashing is an often overlooked consequence of code bloat and sometimes of overly aggressive micro-benchmark driven optimization.

Reorganizing the binary is an interesting approach to minimize the cost, but I think that any performance oriented developer should keep in mind that most projects are rarely dependent on a single hot loop but on many systems working together and competing for space in the cache(s).

I generally use -Os instead of -O2 and -O3 in my projects, while trying to reduce code bloat to a minimum for that reason.

15. dwattttt ◴[06 Nov 24 23:32 UTC] No.42071233{3}[source]▶

>>42069021 #

PGO describes the using extra data to guide optimisations, but it doesn't define what those optimisations are.

Reading the link, there's several that sound like they match what BOLT is applying (Basic Block Optimization, Function Layout, Conditional Branch Optimization, and Dead Code Separation).

16. kardos ◴[06 Nov 24 23:33 UTC] No.42071248{3}[source]▶

>>42067765 #

Thank you!

17. neerajsi ◴[07 Nov 24 00:39 UTC] No.42071879[source]▶

>>42065443 #

Microsoft had internal tooling very similar to bolt almost 20 years ago. Most of those opts were moved to the compiler in ltcg mode with pgo.

18. Iwan-Zotow ◴[07 Nov 24 01:03 UTC] No.42072091[source]▶

>>42066401 #

same in MS DOS

you have far and near pointers modifiers

19. yxhuvud ◴[07 Nov 24 04:43 UTC] No.42073492[source]▶

>>42005429 (OP) #

So am I blind or does it not mention the results? Was the result a faster kernel? How big was the difference?

replies(1): >>42073552 #

20. jeffbee ◴[07 Nov 24 04:59 UTC] No.42073552[source]▶

>>42073492 #

In the actual conference presentation they mention ~2% efficiency gains in a few internal storage systems.

21. teo_zero ◴[07 Nov 24 06:57 UTC] No.42074191[source]▶

>>42066401 #

Not only jumps. The Motorola 68000 has a relative addressing mode where any sufficiently near address can be expressed as PC+offset. Offset is 16 bits, thus covering a local range of ±32kB, with the additional benefit of being position-independent, a valuable feature for systems without virtual memory.

Having learned to program for the Amiga before Intel-based PCs, I was shocked when I realized that the latter are missing that basic feature and position-independent executables must go through run-time relocation!

22. BSDobelix ◴[07 Nov 24 08:28 UTC] No.42074751{3}[source]▶

>>42068991 #

>>BOLT has also recently added support for the kernel.

↑