TPDE-LLVM: Faster LLVM -O0 Back-End

(discourse.llvm.org)

164 points mpweiher | 3 comments | 30 Aug 25 06:55 UTC | HN request time: 0.199s | source

Show context

procrast33 ◴[03 Sep 25 12:21 UTC] No.45114870[source]▶

I am curious why the TPDE paper does not mention the Copy-And-Patch paper. That is a technique that uses LLVM to generate a library of patchable machine code snippets, and during actual compilation those snippets are simply pasted together. In fairness, it is just a proof of concept: they could compile WASM to x64 but not C or C++.

I have no relation to the authors.

https://fredrikbk.com/publications/copy-and-patch.pdf

replies(1): >>45114950 #

aengelke ◴[03 Sep 25 12:34 UTC] No.45114950[source]▶

>>45114870 #

There's a longer paragraph on that topic in Section 8. We also previously built an LLVM back-end using that approach [1]. While that approach leads to even faster compilation, run-time performance is much worse (2.5x slower than LLVM -O0) due to more-or-less impossible register allocation for the snippets.

[1]: https://home.cit.tum.de/~engelke/pubs/2403-cc.pdf

replies(3): >>45115197 #>>45115594 #>>45124559 #

1. debugnik ◴[03 Sep 25 13:01 UTC] No.45115197[source]▶

>>45114950 #

> run-time performance is much worse (2.5x slower than LLVM -O0)

How come? The Copy-and-Patch Compilation paper reports:

> The generated code runs [...] 14% faster than LLVM -O0.

I don't have time right now to compare your approach and benchmark to theirs, but I would have expected comparable performance from what I had read back then.

replies(2): >>45116338 #>>45116367 #

2. t0b1 ◴[03 Sep 25 14:35 UTC] No.45116338[source]▶

>>45115197 (TP) #

This is in relation to their TPCH benchmark which can be due to a variety of reasons. My guess would be that they can generate stencils for whole operators which can be transformed into more efficient code at stencil generation time while LLVM-O0 gets the operator in LLVM-IR form and can do no such transformation. Though I can't verify this because their benchmark setup seems a bit more involved.

When used in a C/C++ compiler the stencils correspond to individual (or a few) LLVM-IR instructions which then leads to bad runtime performance. Also as mentioned, on larger functions register allocation becomes a problem for the Copy-and-Patch approach.

3. aengelke ◴[03 Sep 25 14:37 UTC] No.45116367[source]▶

>>45115197 (TP) #

The paper is rather selective about the used benchmarks and baselines. They do two comparisons (3 microbenchmarks and a re-implementation of a few (rather simple) database queries) against LLVM -- and have written all benchmarks themselves through their own framework. These benchmarks start from their custom AST data structures and they have their own way of generating LLVM-IR. For the non-optimizing LLVM back-end, the performance obviously strongly depends on the way the IR is generated -- they might not have put a lot of effort into generating "good IR" (=IR similar to what Clang generates).

The fact that they don't do a comparison against LLVM on larger benchmarks/functions or any other code they haven't written themselves makes that single number rather questionable for a general claim of being faster than LLVM -O0.

↑