←back to thread

700 points elipsitz | 9 comments | | HN request time: 0.445s | source | bottom
Show context
doe_eyes ◴[] No.41192510[source]
I think it's a good way to introduce these chips, and it's a great project, but the author's (frankly weird) beef with STM32H7 is detracting from the point they're trying to make:

> So, in conclusion, go replan all your STM32H7 projects with RP2350, save money, headaches, and time.

STM32H7 chips can run much faster and have a wider selection of peripherals than RP2350. RP2350 excels in some other dimensions, including the number of (heterogenous) cores. Either way, this is nowhere near apples-to-apples.

Further, they're not the only Cortex-M7 vendor, so if the conclusion is that STM32H7 sucks (it mostly doesn't), it doesn't follow that you should be instead using Cortex-M33 on RPi. You could be going with Microchip (hobbyist-friendly), NXP (preferred by many commercial buyers), or a number of lesser-known manufacturers.

replies(3): >>41192554 #>>41193627 #>>41193749 #
1. Archit3ch ◴[] No.41193627[source]
> STM32H7 chips can run much faster

STM32H7 tops out at 600MHz. This has 2x 300MHz at 2-3 cycles/op FP64. So maybe your applications can fit into this?

replies(5): >>41194219 #>>41194333 #>>41194403 #>>41195297 #>>41195954 #
2. spacedcowboy ◴[] No.41194219[source]
I'm seeing several statements of 2x300MHz, but the page [1] says 2x150MHz M33's..

I know the RP2040's overclock a lot but these are significantly more complex chips, it seems less likely they'll overclock to 2x the base frequency.

[1] https://www.raspberrypi.com/news/raspberry-pi-pico-2-our-new...

replies(1): >>41196360 #
3. ◴[] No.41194333[source]
4. 15155 ◴[] No.41194403[source]
The STM32H7 and other M7 chips have caches - performance is night and day between 2x300MHz smaller, cacheless cores and chips with L1 caches (and things like TCM, etc.)

The SRAM in that H7 is running at commensurately-high speeds, as well.

Comparing an overclocked 2xM33 to a non-overclocked M7 is also probably a little inaccurate - that M7 will easily make more than the rated speed (not nearly as much as the RP2040 M0+, though.)

5. mordae ◴[] No.41195297[source]
It's 6 cycles for dadd/dsub, 16 for dmul, 51 for ddiv.
replies(1): >>41200010 #
6. adrian_b ◴[] No.41195954[source]
As other posters have mentioned, this has 2 Cortex-M33 cores @ 150 MHz, not @ 300 MHz.

Cortex-M7 is in a different size class than Cortex-M33, it has a speed about 50% greater at the same clock frequency and it is also available at higher clock frequencies.

Cortex-M33 is the replacement for the older Cortex-M4 (while Cortex-M23 is the replacement for Cortex-M0+ and Cortex-M85 is the modern replacement for Cortex-M7).

While for a long time the Cortex-M MCUs had been available in 3 main sizes, Cortex-M0+, Cortex-M4 and Cortex-M7, for their modern replacements there is an additional size, Cortex-M55, which is intermediate between Cortex-M33 and Cortex-M85.

7. mrandish ◴[] No.41196360[source]
TFA states extensive 300Mhz OC with no special effort (and he's been evaluating pre-release versions for a year).

"It overclocks insanely well. I’ve been running the device at 300MHz in all of my projects with no issues at all."

Also

"Disclaimer: I was not paid or compensated for this article in any way. I was not asked to write it. I did not seek or obtain any approval from anyone to say anything I said. My early access to the RP2350 was not conditional on me saying something positive (or anything at all) about it publicly."

replies(1): >>41196786 #
8. spacedcowboy ◴[] No.41196786{3}[source]
Thanks, I missed that.
9. vardump ◴[] No.41200010[source]
> 6 cycles for dadd/dsub

I guess it depends whether you store to X (or Y), normalize & round (NRDD; is it really necessary after each addition?) and load X back every time.

Both X and Y have 64 bits of mantissa, 14 bits of exponent and 4 bits of flags, including sign. Some headroom compared to IEEE 754 fp64 53 mantissa and 11 bits of exponent, so I'd assume normalization might not be necessary after every step.

The addition (X = X + Y) itself presumably takes 2 cycles; running coprocessor instructions ADD0 and ADD1. 1 cycle more if normalization is always necessary. And for the simplest real world case, 1 cycle more for loading Y.

Regardless, there might be some room for hand optimizing tight fp64 loops.

Edit: This is based on my current understanding of the available documentation. I might very well be wrong.