Some of the most fun I've had programming assembly has been writing HDMI video scanout kernels for a RP2040 chip[1]. It was a delightful puzzle how to make every single cycle count. It is a great sense of satisfaction of using every one of the 8 "low" registers (the other 8 "high" registers generally take one more cycle to move into a low register, but there are exceptions such as add and compare where they can be free; thus you almost always use a high register for the loop termination comparison). Most satisfying, you can cycle-count and predict the performance very accurately, which is not at all true on modern 64 bit processors. These video kernels could not be written in Rust or C with anywhere near the same performance. Also, in general, Rust compiles to pretty verbose code, which matters a lot when you have limited memory.
Ironically, the reasons for this project being on hold also point to the downside of assembler: since then, the RP2350 chip has come out, and huge parts of the project would need to be rewritten (though it would be much, much more capable than the first version).
[1]: https://github.com/DusterTheFirst/pico-dvi-rs/blob/main/src/...