Most active commenters

cpldcpu(5)

Popular/hot comments

>>41895673 #

Implementing neural networks on the "3 cent" 8-bit microcontroller

(cpldcpu.wordpress.com)

1. magicalhippo ◴[20 Oct 24 01:01 UTC] No.41892062[source]▶

>>41889467 (OP) #

Fun to see neural nets pushed to such extremes, really enjoyed the post.

> The smallest models had to be trained without data augmentation, as they would not converge otherwise.

Was this also the case for the 2-bit model you ended up with?

replies(1): >>41893238 #

2. malwrar ◴[20 Oct 24 02:11 UTC] No.41892377[source]▶

>>41889467 (OP) #

Super interesting!

I wish tfa would have found some way to measure the PMS150C implementation the headline brags about, but even the PFS154 (2x mem, 3x price) version is super neat! Interesting to see how the net in particular is built at such small scale. I also wish they included numbers about performance like they do in their linked CH32V003 post. I'm wondering how quick these MCUs are compared to each other and e.g. OP's PC, and how hot they get under sustained load.

replies(1): >>41893406 #

3. cpldcpu ◴[20 Oct 24 06:05 UTC] No.41893238[source]▶

>>41892062 #

Yes, as far as i remember the limit was somewhere around 1kbyte total parameters size.

4. Lerc ◴[20 Oct 24 06:21 UTC] No.41893315[source]▶

>>41889467 (OP) #

I feel like to really get to the level of hypothetically useful it should be able to take the samples from an input source.

I wonder if you could do it on the full 28*28 by never holding the full image in memory at once, just as an input stream. say a 1d convolution on each line as it comes in to turn a [1,28] to [3,7] buffer two lines of the [3,7] = 42. Then after there are three results of the third line convolution are produced [3,3]=9, start performing a 2d convolution using the first two lines [2,3,:3] replacing the data at the start (as it has already been processed).

replies(1): >>41893420 #

5. cpldcpu ◴[20 Oct 24 06:40 UTC] No.41893406[source]▶

>>41892377 #

There are no performance profiling mechanisms on these small devices, and the timers are rather coarse.

But it is easily possible to estimate the execute time:

- mulacc of one weight takes 11 clock cycles.

- There are 1696 weights in the model, each one is only touched once.

- We can assume ~25%-50% overhead for loops and housekeeping (1:4 unrolled)

=> ~23000-28000 clock cycles per inference, which is less than 2ms at 16MHz

Since this is an MLP, the inference time directly scales with the number of weights. (This would be different for a CNN)

As per veryfing on PMC150C - I considered using an LED for valid/nonvalid output. But iterating with OTP devices is quite tedious when you do not have an emulator. Since both devices are code compatible, we can assume that the code works on the smaller devices, though.

replies(1): >>41895402 #

6. cpldcpu ◴[20 Oct 24 06:44 UTC] No.41893420[source]▶

>>41893315 #

Yes, you could implement it in a way where the first layer is streamed and accumulate on output activations in parallel in the memory. This would limit the memory requirements for the input activations, but would increase execution time, as more activiations have to be shuffled around.

In this case I am streaing from ROM anyways, so it does not matter if the inputs are read only once or multiple times.

7. amelius ◴[20 Oct 24 11:14 UTC] No.41894585[source]▶

>>41889467 (OP) #

This challenges only the memory of the MCU, not the speed.

And it is a bit disappointing that they didn't finish the project by adding a 8x8 pixel camera and a 7-segment display.

replies(1): >>41895503 #

8. pjmlp ◴[20 Oct 24 11:59 UTC] No.41894774[source]▶

>>41889467 (OP) #

As proof of concept, it is quite cool.

However for going into production with something like this, maybe writing everything in Assembly, and not just some parts, would be much better.

But after a quick search it seems the macro assembler story for RISC-V isn't that great.

replies(1): >>41896587 #

9. wongarsu ◴[20 Oct 24 13:58 UTC] No.41895402{3}[source]▶

>>41893406 #

If flipping one of the output pins is fast enough you could use that in combination with an oscilloscope as a coarse but very accurate profiling method.

Though I believe for most people "roughly 2ms" is good enough

10. robertclaus ◴[20 Oct 24 14:13 UTC] No.41895503[source]▶

>>41894585 #

Is an 8bit camera an off-the-shelf part? Or do you just mean data streaming in and out in general?

replies(2): >>41895585 #>>41897170 #

11. amelius ◴[20 Oct 24 14:28 UTC] No.41895585{3}[source]▶

>>41895503 #

Here is an example of an 8x8 camera sensor for hobby use. You can filter out the IR if desired. There are many similar sensors, and they are often used for motion tracking.

https://learn.adafruit.com/adafruit-amg8833-8x8-thermal-came...

replies(1): >>41898200 #

12. Someone ◴[20 Oct 24 14:42 UTC] No.41895673[source]▶

>>41889467 (OP) #

FTA: “One major issue when programming these devices in C is that every function call consumes RAM for the return stack and function parameters. This is unavoidable”

It’s not completely unavoidable: don’t use function parameters (globals are your friends on these CPUs). You can’t avoid having a return stack, but you can make as few function calls as possible (ideally zero, but you may have to write functions to fit things into ROM)

> *”To solve this, I flattened the inference code”

I think that’s “make as few function calls as possible”

> and implemented the inner loop in assembly to optimize variable usage.

That _should_ only make a difference for memory usage if your C compiler isn’t perfect (but of course, it never is, certainly on CPUs like this one, which is a poor fit for C)

replies(4): >>41895876 #>>41896253 #>>41897172 #>>41898306 #

13. dpassens ◴[20 Oct 24 15:20 UTC] No.41895876[source]▶

>>41895673 #

You could kind of avoid the return stack, if you only ever do tail calls. Obviously, that's pretty unrealistic, but it's possible.

replies(1): >>41896707 #

14. cpldcpu ◴[20 Oct 24 16:03 UTC] No.41896253[source]▶

>>41895673 #

>That _should_ only make a difference for memory usage if your C compiler isn’t perfect

Considering that the PMC150 has an accumulator based 8 bit architecture which is almost hostile to C, it is safe to assume that the compiler is not perfect :)

15. kragen ◴[20 Oct 24 16:40 UTC] No.41896587[source]▶

>>41894774 #

The Padauk chips being discussed here aren't RISC-V. Gas is an adequate macro assembler for RISC-V, but I think the Padauk chips don't have anything similar. Still, you can get pretty far with m4... or writing a shell script with an echo function.

16. ska ◴[20 Oct 24 16:52 UTC] No.41896707{3}[source]▶

>>41895876 #

Assuming your compiler implements TCO properly, also.

17. vardump ◴[20 Oct 24 17:54 UTC] No.41897170{3}[source]▶

>>41895503 #

Optical mice have similar cameras.

18. whobre ◴[20 Oct 24 17:54 UTC] No.41897172[source]▶

>>41895673 #

I bet something like Forth would work better on such a microcontroller. It is known to produce very high code density and embedding assembly is usually very straightforward.

19. numpad0 ◴[20 Oct 24 20:20 UTC] No.41898200{4}[source]▶

>>41895585 #

AMG8833 is an actual thermal sensor array, not an e.g. 8x8 photodiode array with IR sensitivity. It's of thermopile type, and different from microbolometer type like FLIR cameras(which has to go blank periodically, unlike themopiles that need not to).

Mouse sensors are 8x8ish cameras, but very few of them(basically only the genuine HP/Agilent/Avago ADNS-2610 at bottom tier price) has raw image export feature, for some reason.

There are many other tiny potato camera parts on various markets, but most of them are missing datasheet && require complicated interfacing.

Overall it's actually not so trivial to get a small image sensor for hobby experiments.

20. varispeed ◴[20 Oct 24 20:44 UTC] No.41898306[source]▶

>>41895673 #

There is probably a way to change calling convention to use something else instead of stack.

↑