Improving performance of rav1d video decoder

1. mmastrac ◴[22 May 25 13:05 UTC] No.44061671[source]▶

The associated issue for comparing two u16s is interesting.

https://github.com/rust-lang/rust/issues/140167

replies(3): >>44061906 #>>44065911 #>>44066028 #

2. heybales ◴[22 May 25 13:34 UTC] No.44061906[source]▶

The thing I like most about this is that the discussion isn't just 14 pages of "I'm having this issue as well" and "Any updates on when this will be fixed?" As a web dev, GitHub issues kinda suck.

replies(2): >>44063190 #>>44073866 #

3. eterm ◴[22 May 25 15:44 UTC] No.44063190[source]▶

>>44061906 #

It was worse before emoji reactions were added and 90% of messages were literally just "+1"

replies(1): >>44064094 #

4. heybales ◴[22 May 25 17:03 UTC] No.44064094{3}[source]▶

>>44063190 #

5. rhdjsjebshjffn ◴[22 May 25 19:36 UTC] No.44065911[source]▶

>>44061671 (TP) #

This just seems to illustrate the complexity of compiler authorship. I am very sure c compilers are wble to address this issue any better in the general case.

replies(2): >>44066162 #>>44066204 #

6. ack_complete ◴[22 May 25 19:46 UTC] No.44066028[source]▶

>>44061671 (TP) #

I'm surprised there's no mention of store forwarding in that discussion. The -O3 codegen is bonkers, but the -O2 output is reasonable. In the case where one of the structs has just been computed, attempting to load it as a single 32-bit load can result in a store forwarding failure that would negate the benefit of merging the loads. In a non-inlined, non-PGO scenario the compiler doesn't have enough information to tell whether the optimization is suitable.

replies(2): >>44069905 #>>44070022 #

7. runevault ◴[22 May 25 19:56 UTC] No.44066162[source]▶

>>44065911 #

Keep in mind Rust is using the same backend as one of the main C compilers, LLVM. So if it is handling it any better that means the Clang developers handle it before it even reaches the shared LLVM backend. Well, or there is something about the way Clang structures the code that catches a pattern in the backend the Rust developers do not know about.

replies(1): >>44068937 #

8. vlovich123 ◴[22 May 25 19:59 UTC] No.44066204[source]▶

>>44065911 #

The rust issue has people trying this with c code and the compiler generates the same issue. This will get fixed and it’ll help c and Rust code

replies(1): >>44068993 #

9. rhdjsjebshjffn ◴[23 May 25 01:27 UTC] No.44068937{3}[source]▶

>>44066162 #

I mean yea, i just view rust as the quality-oriented spear of western development.

Rust is absolutely an improvement over C in every way.

10. runevault ◴[23 May 25 01:34 UTC] No.44068993{3}[source]▶

>>44066204 #

Out of curiosity just clang or gcc as well?

replies(1): >>44072736 #

11. Dylan16807 ◴[23 May 25 04:42 UTC] No.44069905[source]▶

>>44066028 #

> In the case where one of the structs has just been computed, attempting to load it as a single 32-bit load can result in a store forwarding failure that would negate the benefit of merging the loads

Would that failure be significantly worse than separate loading?

Just negating the optimization wouldn't be much reason against doing it. A single load is simpler and in the general case faster.

replies(2): >>44078234 #>>44084378 #

12. mshockwave ◴[23 May 25 05:08 UTC] No.44070022[source]▶

>>44066028 #

> In the case where one of the structs has just been computed, attempting to load it as a single 32-bit load can result in a store forwarding failure

It actually depends on the uArch, Apple silicon doesn't seem to have this restriction: https://news.ycombinator.com/item?id=43888005

> In a non-inlined, non-PGO scenario the compiler doesn't have enough information to tell whether the optimization is suitable.

I guess you're talking about stores and load across function boundaries?

Trivia: X86 LLVM creates a whole Pass just to prevent this partial-store-to-load issue on Intel CPUs: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

13. josephg ◴[23 May 25 13:31 UTC] No.44072736{4}[source]▶

>>44068993 #

I just tried it, and the problem is even worse in gcc.

Given this C code:

    typedef struct { uint16_t a, b; } pair;

    int eq_copy(pair a, pair b) {
        return a.a == b.a && a.b == b.b;
    }
    int eq_ref(pair *a, pair *b) {
        return a->a == b->a && a->b == b->b;
    }

Clang generates clean code for the eq_copy variant, but complex code for the eq_ref variant. Gcc emits pretty complex code in both variants.

For example, here's eq_ref from gcc -O2:

    eq_ref:
        movzx   edx, WORD PTR [rsi]
        xor     eax, eax
        cmp     WORD PTR [rdi], dx
        je      .L9
        ret
    .L9:
        movzx   eax, WORD PTR [rsi+2]
        cmp     WORD PTR [rdi+2], ax
        sete    al
        movzx   eax, al
        ret

Have a play around: https://c.godbolt.org/z/79Eaa3jYf

14. NoMoreNicksLeft ◴[23 May 25 15:43 UTC] No.44073866[source]▶

>>44061906 #

Wonder if it's a poor interface issue... if people could just click a button that says "me too" but didn't add a full comment but rather just added some minimal notation at the bottom of the comment that indicated their username, 1) would people use it and 2) would that be not overly-busy enough to not be annoying? It could even mute notifications for the me-toos.

replies(1): >>44134415 #

15. ack_complete ◴[24 May 25 23:24 UTC] No.44084378{3}[source]▶

>>44069905 #

Usually, yeah, it's noticeably worse than using individual loads and stores as it adds around a dozen cycles of latency. This is usually enough for the load to light up hot in a sampling profile. It's possible for that extra latency to be hidden, but then in that case the extra loads/stores wouldn't be an issue either.

16. IshKebab ◴[30 May 25 09:24 UTC] No.44134415{3}[source]▶

>>44073866 #

This seems like an area where LLMs would actually be extremely useful. You can manually mark comments as irrelevant. Why can't GitHub use AI to do it automatically? Or to highlight the "resolution" comment automatically? On very big issues it can take a non-trivial amount of time just to find out what the outcome was.