Improving performance of rav1d video decoder

1. tialaramex ◴[22 May 25 13:07 UTC] No.44061686[source]▶

All being equal codecs ought to be in WUFFS† rather than Rust, but I can well imagine that it's a much bigger lift to take something as complicated as dav1d and write the analogous WUFFS than to clean up the c2rust translation, if you said a thousand times harder I'd have no trouble believing that. I just think it's worth it for us as a civilisation.

† Or an equivalent special purpose language, but WUFFS is right there

replies(1): >>44061961 #

2. IgorPartola ◴[22 May 25 13:40 UTC] No.44061961[source]▶

>>44061686 (TP) #

WUFFS would be great for parsing container files (Matroska, webm, mp4) but it does not seem at all suitable for a video decoder. Without dynamic memory allocation it would be challenging to deal with dynamic data. Video codecs are not simply parsing a file to get the data, they require quite a bit of very dynamic state to be managed.

replies(1): >>44062041 #

3. lubesGordi ◴[22 May 25 13:49 UTC] No.44062041[source]▶

>>44061961 #

Requiring dynamic state seems not obvious to me. At the end of the day you have a fixed number of pixels on the screen. If every single pixel changes from frame to frame that should constitute the most work your codec has to do, no? I'm not a codec writer but that's my intuition based on the assumption that codecs are basically designed to minimize the amount of 'work' being done from frame to frame.

replies(5): >>44062055 #>>44062122 #>>44062124 #>>44062182 #>>44063139 #

4. throwawaymaths ◴[22 May 25 13:50 UTC] No.44062055{3}[source]▶

>>44062041 #

compression algorithms can get very clever in recursive ways

5. dylan604 ◴[22 May 25 13:58 UTC] No.44062122{3}[source]▶

>>44062041 #

Maybe you're not familiar with how long GOP encoding works with IPB frames? If all frames were I-frames, maybe what you're thinking might work. Everything you need is in the one frame to be able to describe every single pixel in that frame. Once you start using P-frames, you have to hold on to data from the I-frame to decode the P-frame. With B-frames, you might need data from frames not yet decoded as the are bi-direction references.

replies(1): >>44063338 #

6. zimpenfish ◴[22 May 25 13:58 UTC] No.44062124{3}[source]▶

>>44062041 #

> codecs are basically designed to minimize the amount of 'work' being done from frame to frame

But to do that they have to keep state and do computations on that state. If you've got frame 47 being a P frame, that means you need frame 46 to decode it correctly. Or frame 47 might be a B frame in which case you need frame 46 and possibly also frame 48 - which means you're having to unpack frames "ahead" of yourself and then keep them around for the next decode.

I think that all counts as "dynamic state"?

replies(1): >>44063686 #

7. IgorPartola ◴[22 May 25 14:04 UTC] No.44062182{3}[source]▶

>>44062041 #

If you are doing something like a GIF or an MJPEG, sure. If you are doing forwards and backwards keyframes with a variable amount of deltas in between, with motion estimation, with grain generation, you start having a very dynamic amount of state. Granted, encoders are more complex than decoders in some of this. But still you might need to decode between 1 and N frames to get the frame you want, and you don't know how much memory it will consume once it is decoded unless you decode it into bitmaps (at 4k that would be over 8MB per frame which very quickly runs out of memory for you if you want any sort of frame buffer present).

I suspect the future of video compression will also include frame generation, like what is currently being done for video games. Essentially you have let's say 12 fps video but your video card can fill in the intermediate frames via what is basically generative AI so you get 120 fps output with smooth motion. I imagine that will never be something that WUFFS is best suited for.

replies(3): >>44062920 #>>44063296 #>>44063827 #

8. derf_ ◴[22 May 25 15:15 UTC] No.44062920{4}[source]▶

>>44062182 #

> But still you might need to decode between 1 and N frames to get the frame you want, and you don't know how much memory it will consume...

All of these things are bounded for actual codecs. AV1 allows storing at most 8 reference frames. The sequence header will specify a maximum allowable resolution for any frame. The number of motion vectors is fixed once you know the resolution. Film grain requires only a single additional buffer. There are "levels" specified which ensure interoperability at common operating points (e.g., 4k) without even relying on the sequence header (you just reject sequences that fall outside the limits). Those are mostly intended for hardware, but there is no reason a software decoder could not take advantage of them. As long as codecs are designed to be implemented in hardware, this will be possible.

9. lubesGordi ◴[22 May 25 15:39 UTC] No.44063139{3}[source]▶

>>44062041 #

Hey maybe we can discuss why I'm being downvoted? This is a technical discussion and I'm contributing. If you disagree then say why. I'm not stating anything as fact that isn't fact. I am getting downvoted for asking a question.

10. lubesGordi ◴[22 May 25 15:54 UTC] No.44063296{4}[source]▶

>>44062182 #

See this is interesting to me. I understand the desire to dynamically allocate buffers at runtime to capture variable size deltas. That's cool, but also still maybe technically unnecessary? Because like you say, at 4k and over 8MB per frame; you still can't allocate over a limit. So likely a codec would have some boundary set on that anyway. Why not just pre-allocate at compile time? For sure this results in a complex data structure. Functionally it could be the same and we would elide the cost of dynamic memory allocations. What I'm suggesting is probably complex, I'm sure.

In any case I get what you're saying and I understand why codecs are going to be dynamically allocating memory, so thanks for that.

11. lubesGordi ◴[22 May 25 15:59 UTC] No.44063338{4}[source]▶

>>44062122 #

Still you don't necessarily need to have dynamic memory allocations if the number of deltas you have is bounded. In some codecs I could definitely see those having a varying size depending on the amount of change going on in the scene.

I'm not a codec developer, I'm only coming at this from an outside/intuitive perspective. Generally, performance concerned parties want to minimize heap allocations, so I'm interested in this as how it applies in codec architecture. Codecs seem so complex to me, with so much inscrutable shit going on, but then heap allocations aren't optimized out? Seems like there has to be a very good reason for this.

replies(2): >>44067703 #>>44067947 #

12. wtallis ◴[22 May 25 16:33 UTC] No.44063686{4}[source]▶

>>44062124 #

Memory usage can vary, but video codecs are designed to make it practical to derive bounds on those memory requirements because hardware implementations don't have the freedom to dynamically allocate more silicon.

replies(1): >>44068635 #

13. GuB-42 ◴[22 May 25 16:44 UTC] No.44063827{4}[source]▶

>>44062182 #

> I suspect the future of video compression will also include frame generation

That's how most video codecs work already. They try to "guess" what the next frame will be, based on past (for P-frames) and future (for B-frames) frames. The difference is that the codec encodes some metadata to help with the process and also the difference between the predicted frame and the real frame.

As for using AI techniques to improve prediction, it is not a new thing at all. Many algorithms optimized for compression ratio use neural nets, but these tend to be too computationally expensive for general use. In fact the Hutter prize considers text compression as an AI/AGI problem.

14. Sesse__ ◴[22 May 25 22:11 UTC] No.44067703{5}[source]▶

>>44063338 #

The very good reason is that there's simply not a lot of heap allocations going on. It's easy to check; run perf against e.g. ffmpeg decoding a big file to /dev/null, and observe the distinct lack of malloc high up in the profile.

There's a heck of a lot of distance from “not a lot” to “zero”, though.

15. izacus ◴[22 May 25 22:42 UTC] No.44067947{5}[source]▶

>>44063338 #

You're actually right about allocation - most video codecs are written with hardware decoders in mind which have fixed memory size. This is why their profiles hard limit the memory constraints needed for decode - resolution, number of reference frames, etc.

That's not quite the case for encoding - that's where things get murky since you have way more freedom at what you can do to compress better.

16. ◴[23 May 25 00:36 UTC] No.44068635{5}[source]▶

>>44063686 #