I got almost all of my wishes granted with RP2350

(dmitry.gr)

700 points elipsitz | 1 comments | 08 Aug 24 13:03 UTC | HN request time: 0s | source

Show context

doe_eyes ◴[08 Aug 24 15:23 UTC] No.41192510[source]▶

>>41191069 (OP) #

I think it's a good way to introduce these chips, and it's a great project, but the author's (frankly weird) beef with STM32H7 is detracting from the point they're trying to make:

> So, in conclusion, go replan all your STM32H7 projects with RP2350, save money, headaches, and time.

STM32H7 chips can run much faster and have a wider selection of peripherals than RP2350. RP2350 excels in some other dimensions, including the number of (heterogenous) cores. Either way, this is nowhere near apples-to-apples.

Further, they're not the only Cortex-M7 vendor, so if the conclusion is that STM32H7 sucks (it mostly doesn't), it doesn't follow that you should be instead using Cortex-M33 on RPi. You could be going with Microchip (hobbyist-friendly), NXP (preferred by many commercial buyers), or a number of lesser-known manufacturers.

replies(3): >>41192554 #>>41193627 #>>41193749 #

Archit3ch ◴[08 Aug 24 16:53 UTC] No.41193627[source]▶

>>41192510 #

> STM32H7 chips can run much faster

STM32H7 tops out at 600MHz. This has 2x 300MHz at 2-3 cycles/op FP64. So maybe your applications can fit into this?

replies(5): >>41194219 #>>41194333 #>>41194403 #>>41195297 #>>41195954 #

mordae ◴[08 Aug 24 19:31 UTC] No.41195297[source]▶

>>41193627 #

It's 6 cycles for dadd/dsub, 16 for dmul, 51 for ddiv.

replies(1): >>41200010 #

1. vardump ◴[09 Aug 24 08:46 UTC] No.41200010[source]▶

>>41195297 #

> 6 cycles for dadd/dsub

I guess it depends whether you store to X (or Y), normalize & round (NRDD; is it really necessary after each addition?) and load X back every time.

Both X and Y have 64 bits of mantissa, 14 bits of exponent and 4 bits of flags, including sign. Some headroom compared to IEEE 754 fp64 53 mantissa and 11 bits of exponent, so I'd assume normalization might not be necessary after every step.

The addition (X = X + Y) itself presumably takes 2 cycles; running coprocessor instructions ADD0 and ADD1. 1 cycle more if normalization is always necessary. And for the simplest real world case, 1 cycle more for loading Y.

Regardless, there might be some room for hand optimizing tight fp64 loops.

Edit: This is based on my current understanding of the available documentation. I might very well be wrong.

↑