Improving performance of rav1d video decoder

(ohadravid.github.io)

305 points todsacerdoti | 1 comments | 22 May 25 11:59 UTC | HN request time: 0.211s | source

Show context

mmastrac ◴[22 May 25 13:05 UTC] No.44061671[source]▶

>>44061160 (OP) #

The associated issue for comparing two u16s is interesting.

https://github.com/rust-lang/rust/issues/140167

replies(3): >>44061906 #>>44065911 #>>44066028 #

ack_complete ◴[22 May 25 19:46 UTC] No.44066028[source]▶

>>44061671 #

I'm surprised there's no mention of store forwarding in that discussion. The -O3 codegen is bonkers, but the -O2 output is reasonable. In the case where one of the structs has just been computed, attempting to load it as a single 32-bit load can result in a store forwarding failure that would negate the benefit of merging the loads. In a non-inlined, non-PGO scenario the compiler doesn't have enough information to tell whether the optimization is suitable.

replies(2): >>44069905 #>>44070022 #

1. mshockwave ◴[23 May 25 05:08 UTC] No.44070022[source]▶

>>44066028 #

> In the case where one of the structs has just been computed, attempting to load it as a single 32-bit load can result in a store forwarding failure

It actually depends on the uArch, Apple silicon doesn't seem to have this restriction: https://news.ycombinator.com/item?id=43888005

> In a non-inlined, non-PGO scenario the compiler doesn't have enough information to tell whether the optimization is suitable.

I guess you're talking about stores and load across function boundaries?

Trivia: X86 LLVM creates a whole Pass just to prevent this partial-store-to-load issue on Intel CPUs: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

↑