Most active commenters
  • zootboy(3)

←back to thread

287 points shadaj | 11 comments | | HN request time: 1.118s | source | bottom
Show context
bsnnkv ◴[] No.43196091[source]
Last month I switched from a role working on a distributed system (FAANG) to a role working on embedded software which runs on cards in data center racks.

I was in my last role for a year, and 90%+ of my time was spent investigating things that went "missing" at one of many failure points between one of the many distributed components.

I wrote less than 200 lines of code that year and I experienced the highest level of burnout in my professional career.

The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it. Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

So far the culture in my new embedded (Rust, fwiw) position is the complete opposite. If you're burnt out working on distributed systems and you care about some of the same things that I do, it's worth giving embedded software dev a shot.

replies(24): >>43196122 #>>43196159 #>>43196163 #>>43196180 #>>43196239 #>>43196674 #>>43196899 #>>43196910 #>>43196931 #>>43197177 #>>43197902 #>>43198895 #>>43199169 #>>43199589 #>>43199688 #>>43199980 #>>43200186 #>>43200596 #>>43200725 #>>43200890 #>>43202090 #>>43202165 #>>43205115 #>>43208643 #
EtCepeyd ◴[] No.43196239[source]
This resonates a lot with me.

Distributed systems require insanely hard math at the bottom (paxos, raft, gossip, vector clocks, ...) It's not how the human brain works natively -- we can learn abstract thinking, but it's very hard. Embedded systems sometimes require the parallelization of some hot spots, but those are more like the exception AIUI, and you have a lot more control over things; everything is more local and sequential. Even data race free multi-threaded programming in modern C and C++ is incredibly annoying; I dislike dealing with both an explicit mesh of peers, and with a leaky abstraction that lies that threads are "symmetric" (as in SMP) while in reality there's a complicated messaging network underneath. Embedded is simpler, and it seems to require less that practitioners become advanced mathematicians for day to day work.

replies(5): >>43196342 #>>43196567 #>>43196906 #>>43197331 #>>43199711 #
1. AlotOfReading ◴[] No.43196906[source]
Most embedded systems are distributed systems these days, there's simply a cultural barrier that prevents most practitioners from fully grappling with that fact. A lot of systems I've worked on have benefited from copying ideas invented by distributed systems folks working on networking stuff 20 years ago.
replies(3): >>43197438 #>>43199391 #>>43199780 #
2. DanielHB ◴[] No.43197438[source]
I worked in an IoT platform that consisted of 3 embedded CPUs and one linux board. The kicker was that the linux board could only talk directly to one of the chips, but had to be capable of updating the software running on all of them.

That platform was parallelizable of up to 6 of its kind in a master-slave configuration (so the platform in the physical position 1 would assume the "master role" for a total of 18 embedded chips and 6 linux boards) on top of having optionally one more box with one more CPU in it for managing some other stuff and integrating with each of our clients hardware. Each client had a different integration, but at least they mostly integrated with us, not the other way around.

Yeah it was MUCH more complex than your average cloud. Of course the original designers didn't even bother to make a common network protocol for the messages, so each point of communication not only used a different binary format, they also used different wire formats (CAN bus, Modbus and ethernet).

But at least you didn't need to know kubernetes, just a bunch of custom stuff that wasn't well documented. Oh yeah and don't forget the boot loaders for each embedded CPU, we had to update the bootloaders so many times...

The only saving grace is that a lot of the system could rely on the literal physical security because you need to have physical access (and a crane) to reach most of the system. Pretty much only the linux boards had to have high security standards and that was not that complicated to lock down (besides maintaining a custom yocto distribution that is).

replies(1): >>43197538 #
3. AlotOfReading ◴[] No.43197538[source]
Many automotive systems have >100 processors scattered around the vehicle, maybe a dozen of which are "important". I'm amazed they ever work given the quality of the code running on them.
replies(1): >>43197571 #
4. DanielHB ◴[] No.43197571{3}[source]
A LOT of QA
5. zootboy ◴[] No.43199391[source]
Indeed. I've been building systems that orchestrate batteries and power sources. Turns out, it's a difficult problem to temporally align data points produced by separate components that don't share any sort of common clock source. Just take the latest power supply current reading and subtract the latest battery current reading to get load current? Oops, they don't line up, and now you get bizarre values (like negative load power) when there's a fast load transient.

Even more fun when multiple devices share a single communication bus, so you're basically guaranteed to not get temporally-aligned readings from all of the devices.

replies(1): >>43200688 #
6. anitil ◴[] No.43199780[source]
Yes even 'simple' devices these days will have devices (ADC/SPI etc) running in parallel often using DMA, multiple semi-independent clocks, possibly nested interrupts etc. Oh and the UART for some reason always, always has bugs, so hopefully you're using multiple levels of error checking.
replies(1): >>43200654 #
7. zootboy ◴[] No.43200654[source]
Yeah, it was a "fun" surprise to discover the errata sheet for the microcontroller I was working with after beating my head against the wall trying to figure out why it doesn't do what the reference manual says it should do. It's especially "fun" when the errata is "The hardware flow control doesn't work. Like, at all. Just don't even try."
replies(1): >>43201489 #
8. szvsw ◴[] No.43200688[source]
I run a small SaaS side hustle where the core value proposition of the product - at least what got us our first customers, even if they did not realize what was happening under the hood - is, essentially, an implementation of NTP running over HTTPS that can be run on some odd devices and sync those devices to mobile phones via a front end app and backend server. There’s some other CMS stuff that makes it easy for the various customers to serve their content to their customers’ devices, but at the end of the day our core trade secret is just using a roll-your-own NTP implementation… I love how NTP is just the tip of the iceberg when it comes to the wicked problem of aligning clocks. This is all just to say - I feel your pain, but also not really since it sounds like you are dealing with higher precision and greater challenges than I ever had to!

Here’s a great podcast on the topic which you will surely like!

https://signalsandthreads.com/clock-synchronization/

And a related HN thread in case you missed it:

https://news.ycombinator.com/item?id=39298652

replies(1): >>43200976 #
9. zootboy ◴[] No.43200976{3}[source]
The ultimate frustration is when you have no real ability to fix the core problem. NTP (and its 'roided-up cousin PTP) are great, but they require a degree of control and influence over the end devices that I just don't have. No amount of pleading will get a battery vendor to implement NTP in their BMS firmware, and I don't have nearly enough stacks of cash to wave around to commission a custom firmware. So I'm pretty much stuck with the "black box cat herding" technique of interoperation.
replies(1): >>43201262 #
10. szvsw ◴[] No.43201262{4}[source]
Yeah, that makes sense. We are lucky in that we get to deploy our code to the devices. It’s not really “embedded” in the sense most people use as these are essentially sandboxed Linux devices that only run applications written in a programming language specific to these devices which is similar to Lua/python but the scripts get turned into byte code at boot IIRC, but none the less very powerful/fast.

You work on BMS stuff? That’s cool- a little bit outside my domain (I do energy modeling research for buildings) but have been to some fun talks semi-recently about BMs/BAS/telemetry in buildings etc. The whole landscape seems like a real mess there.

FYI that podcast I linked has some interesting discussion about some issues with PTP over NTP- worth listening to for sure.

11. anitil ◴[] No.43201489{3}[source]
The thing that would break my brain is that the errata is a pdf that you get from .... some link, somewhere