Most active commenters

EtCepeyd(3)
zootboy(3)

Popular/hot comments

>>43196906 #

←back to thread

Distributed systems programming has stalled

(www.shadaj.me)

Show context

bsnnkv ◴[27 Feb 25 16:52 UTC] No.43196091[source]▶

>>43195702 (OP) #

Last month I switched from a role working on a distributed system (FAANG) to a role working on embedded software which runs on cards in data center racks.

I was in my last role for a year, and 90%+ of my time was spent investigating things that went "missing" at one of many failure points between one of the many distributed components.

I wrote less than 200 lines of code that year and I experienced the highest level of burnout in my professional career.

The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it. Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

So far the culture in my new embedded (Rust, fwiw) position is the complete opposite. If you're burnt out working on distributed systems and you care about some of the same things that I do, it's worth giving embedded software dev a shot.

replies(24): >>43196122 #>>43196159 #>>43196163 #>>43196180 #>>43196239 #>>43196674 #>>43196899 #>>43196910 #>>43196931 #>>43197177 #>>43197902 #>>43198895 #>>43199169 #>>43199589 #>>43199688 #>>43199980 #>>43200186 #>>43200596 #>>43200725 #>>43200890 #>>43202090 #>>43202165 #>>43205115 #>>43208643 #

1. EtCepeyd ◴[27 Feb 25 17:07 UTC] No.43196239[source]▶

>>43196091 #

This resonates a lot with me.

Distributed systems require insanely hard math at the bottom (paxos, raft, gossip, vector clocks, ...) It's not how the human brain works natively -- we can learn abstract thinking, but it's very hard. Embedded systems sometimes require the parallelization of some hot spots, but those are more like the exception AIUI, and you have a lot more control over things; everything is more local and sequential. Even data race free multi-threaded programming in modern C and C++ is incredibly annoying; I dislike dealing with both an explicit mesh of peers, and with a leaky abstraction that lies that threads are "symmetric" (as in SMP) while in reality there's a complicated messaging network underneath. Embedded is simpler, and it seems to require less that practitioners become advanced mathematicians for day to day work.

replies(5): >>43196342 #>>43196567 #>>43196906 #>>43197331 #>>43199711 #

2. Thaxll ◴[27 Feb 25 17:17 UTC] No.43196342[source]▶

>>43196239 (TP) #

It does not requires any math because 99.9% of the time the issue is not in the low level implementation but in the business logic that the dev did.

No one goes to review the transaction engine of Postgress.

replies(1): >>43196485 #

3. EtCepeyd ◴[27 Feb 25 17:31 UTC] No.43196485[source]▶

>>43196342 #

I tend to disagree.

- You work on postgres: you have to deal with the transaction engine's internals.

- You work in enterprise application intergration (EAI): you have ten legacy systems that inevitably don't all interoperate with any one specific transaction manager product. Thus, you have to build adapters, message routing and propagation, gateways, at-least-once-but-idempotent delivery, and similar stuff, yourself. SQL business logic will be part of it, but it will not solve the hard problems, and you still have to dig through multiple log files on multiple servers, hoping that you can rely on unique request IDs end-to-end (and that the timestamps across those multiple servers won't be overly contradictory).

In other words: same challenges at either end of the spectrum.

replies(1): >>43196946 #

4. motorest ◴[27 Feb 25 17:39 UTC] No.43196567[source]▶

>>43196239 (TP) #

> Distributed systems require insanely hard math at the bottom (paxos, raft, gossip, vector clocks, ...) It's not how the human brain works natively -- we can learn abstract thinking, but it's very hard.

I think this take is misguided. Most of the systems nowadays, specially those involving any sort of network cals, are already distributed systems. Yet, the amount of systems go even close to touching fancy consensus algorithms is very very limited. If you are in a position to design a system and you hear "Paxos" coming out of your mouth, that's the moment you need to step back and think about what you are doing. Odds are you are creating your own problems, and then blaming the tools.

replies(2): >>43197705 #>>43199817 #

5. AlotOfReading ◴[27 Feb 25 18:17 UTC] No.43196906[source]▶

>>43196239 (TP) #

Most embedded systems are distributed systems these days, there's simply a cultural barrier that prevents most practitioners from fully grappling with that fact. A lot of systems I've worked on have benefited from copying ideas invented by distributed systems folks working on networking stuff 20 years ago.

replies(3): >>43197438 #>>43199391 #>>43199780 #

6. pfannkuchen ◴[27 Feb 25 18:23 UTC] No.43196946{3}[source]▶

>>43196485 #

Yeah this is kind of an abstraction failure of the infrastructure. Ideally the surface visible to the user should be simple across the entire spectrum of use cases. In some very, very rare cases one necessarily has to spelunk under the facade and know something about the internals, but for some reason it seems to happen much more often in the real world. I think people often don't put enough effort into making their system model fit with the native model of the infrastructure, and instead torture the infrastructure interface (often including the "break glass" parts) to fit into their a priori system model.

7. toast0 ◴[27 Feb 25 19:01 UTC] No.43197331[source]▶

>>43196239 (TP) #

That's true, but you can do a lot of that once, and then get on with your life, if you build the right structures. I've gotten a huge amount of mileage from consensus to decide where to send reads/writes to, then everyone sends their reads/writes for the same piece of data to the same place; that place does the application logic where it's simple, and sends the result back. If you don't get the result back in time, bubble it up to the end-user application and it may retry or not, depending.

This is built upon a framework of the network is either working or the server team / ops team is paged and will be actively trying to figure it out. It doesn't work nearly as well if you work in an environment where the network is consistently slightly broken.

8. DanielHB ◴[27 Feb 25 19:12 UTC] No.43197438[source]▶

>>43196906 #

I worked in an IoT platform that consisted of 3 embedded CPUs and one linux board. The kicker was that the linux board could only talk directly to one of the chips, but had to be capable of updating the software running on all of them.

That platform was parallelizable of up to 6 of its kind in a master-slave configuration (so the platform in the physical position 1 would assume the "master role" for a total of 18 embedded chips and 6 linux boards) on top of having optionally one more box with one more CPU in it for managing some other stuff and integrating with each of our clients hardware. Each client had a different integration, but at least they mostly integrated with us, not the other way around.

Yeah it was MUCH more complex than your average cloud. Of course the original designers didn't even bother to make a common network protocol for the messages, so each point of communication not only used a different binary format, they also used different wire formats (CAN bus, Modbus and ethernet).

But at least you didn't need to know kubernetes, just a bunch of custom stuff that wasn't well documented. Oh yeah and don't forget the boot loaders for each embedded CPU, we had to update the bootloaders so many times...

The only saving grace is that a lot of the system could rely on the literal physical security because you need to have physical access (and a crane) to reach most of the system. Pretty much only the linux boards had to have high security standards and that was not that complicated to lock down (besides maintaining a custom yocto distribution that is).

replies(1): >>43197538 #

9. AlotOfReading ◴[27 Feb 25 19:22 UTC] No.43197538{3}[source]▶

>>43197438 #

Many automotive systems have >100 processors scattered around the vehicle, maybe a dozen of which are "important". I'm amazed they ever work given the quality of the code running on them.

replies(1): >>43197571 #

10. DanielHB ◴[27 Feb 25 19:26 UTC] No.43197571{4}[source]▶

>>43197538 #

A LOT of QA

11. convolvatron ◴[27 Feb 25 19:42 UTC] No.43197705[source]▶

>>43196567 #

this is completely backwards. the tools may have some internal consistency guarantees, handle some classes of failures, etc. They are leaky abstractions that are partially correct. There were not collectively designed to handle all failures and consistent views no matter their composition.

From the other direction, Paxos, two generals, serializability, etc. are not hard concepts at all. Implementing custome solutions in this space _is_ hard and prone to error, but the foundations are simple and sound.

You seem to be claiming that you shouldn't need to understand the latter, that the former gives you everything you need. I would say that if you build systems using existing tools without even thinking about the latter, you're just signing up to handling preventable errors manually and treating this box that you own and black and inscrutable.

12. zootboy ◴[27 Feb 25 22:44 UTC] No.43199391[source]▶

>>43196906 #

Indeed. I've been building systems that orchestrate batteries and power sources. Turns out, it's a difficult problem to temporally align data points produced by separate components that don't share any sort of common clock source. Just take the latest power supply current reading and subtract the latest battery current reading to get load current? Oops, they don't line up, and now you get bizarre values (like negative load power) when there's a fast load transient.

Even more fun when multiple devices share a single communication bus, so you're basically guaranteed to not get temporally-aligned readings from all of the devices.

replies(1): >>43200688 #

13. PaulDavisThe1st ◴[27 Feb 25 23:21 UTC] No.43199711[source]▶

>>43196239 (TP) #

> Even data race free multi-threaded programming in modern C and C++ is incredibly annoying; I dislike dealing with both an explicit mesh of peers, and with a leaky abstraction that lies that threads are "symmetric" (as in SMP) while in reality there's a complicated messaging network underneath.

If you're using traditional (p)threads-derived APIs to get work done on a message passing system, I'd say you're using the wrong API.

More likely, I don't understand what you might mean here.

replies(1): >>43199965 #

14. anitil ◴[27 Feb 25 23:31 UTC] No.43199780[source]▶

>>43196906 #

Yes even 'simple' devices these days will have devices (ADC/SPI etc) running in parallel often using DMA, multiple semi-independent clocks, possibly nested interrupts etc. Oh and the UART for some reason always, always has bugs, so hopefully you're using multiple levels of error checking.

replies(1): >>43200654 #

15. yodsanklai ◴[27 Feb 25 23:35 UTC] No.43199817[source]▶

>>43196567 #

I remember when I prepared for system design interviews in FAANG, I was anxious I would get asked about Paxos (which I learned at school). Now that I'm working there, never heard about Paxos or fancy distributed algorithms. We rely on various high-level services for deployment, partitioning, monitoring, logging, service discovery, storage...

And Paxos doesn't require much maths. It's pretty tricky to consider all possible interleavings, but in term of maths, it's really basic discrete maths.

replies(1): >>43202413 #

16. EtCepeyd ◴[27 Feb 25 23:51 UTC] No.43199965[source]▶

>>43199711 #

Sorry, I figure I ended up spewing a bit of gibberish.

- By "explicit mesh of peers", I referred to atomics, and the modern (C11 and later) memory model. The memory model, for example as written up in the C11 and later standards, is impenetrable. While the atomics interfaces do resemble a messaging passing system between threads, and therefore seem to match the underlying hardware closely, they are discomforting because their foundation, the memory model, is in fact laid out in the PhD dissertation of Mark John Batty, "The C11 and C++11 Concurrency Model" -- 400+ pages! <https://www.cl.cam.ac.uk/~pes20/papers/topic.c11.group_abstr...>

- By "leaky abstraction", I mean the stronger posix threads / standard C threads interfaces. They are more intuitive and safer, but are more distant from the hardware, so people sometimes frown at them for being expensive.

17. zootboy ◴[28 Feb 25 01:37 UTC] No.43200654{3}[source]▶

>>43199780 #

Yeah, it was a "fun" surprise to discover the errata sheet for the microcontroller I was working with after beating my head against the wall trying to figure out why it doesn't do what the reference manual says it should do. It's especially "fun" when the errata is "The hardware flow control doesn't work. Like, at all. Just don't even try."

replies(1): >>43201489 #

18. szvsw ◴[28 Feb 25 01:42 UTC] No.43200688{3}[source]▶

>>43199391 #

I run a small SaaS side hustle where the core value proposition of the product - at least what got us our first customers, even if they did not realize what was happening under the hood - is, essentially, an implementation of NTP running over HTTPS that can be run on some odd devices and sync those devices to mobile phones via a front end app and backend server. There’s some other CMS stuff that makes it easy for the various customers to serve their content to their customers’ devices, but at the end of the day our core trade secret is just using a roll-your-own NTP implementation… I love how NTP is just the tip of the iceberg when it comes to the wicked problem of aligning clocks. This is all just to say - I feel your pain, but also not really since it sounds like you are dealing with higher precision and greater challenges than I ever had to!

Here’s a great podcast on the topic which you will surely like!

https://signalsandthreads.com/clock-synchronization/

And a related HN thread in case you missed it:

https://news.ycombinator.com/item?id=39298652

replies(1): >>43200976 #

19. zootboy ◴[28 Feb 25 02:21 UTC] No.43200976{4}[source]▶

>>43200688 #

The ultimate frustration is when you have no real ability to fix the core problem. NTP (and its 'roided-up cousin PTP) are great, but they require a degree of control and influence over the end devices that I just don't have. No amount of pleading will get a battery vendor to implement NTP in their BMS firmware, and I don't have nearly enough stacks of cash to wave around to commission a custom firmware. So I'm pretty much stuck with the "black box cat herding" technique of interoperation.

replies(1): >>43201262 #

20. szvsw ◴[28 Feb 25 03:01 UTC] No.43201262{5}[source]▶

>>43200976 #

Yeah, that makes sense. We are lucky in that we get to deploy our code to the devices. It’s not really “embedded” in the sense most people use as these are essentially sandboxed Linux devices that only run applications written in a programming language specific to these devices which is similar to Lua/python but the scripts get turned into byte code at boot IIRC, but none the less very powerful/fast.

You work on BMS stuff? That’s cool- a little bit outside my domain (I do energy modeling research for buildings) but have been to some fun talks semi-recently about BMs/BAS/telemetry in buildings etc. The whole landscape seems like a real mess there.

FYI that podcast I linked has some interesting discussion about some issues with PTP over NTP- worth listening to for sure.

21. anitil ◴[28 Feb 25 03:45 UTC] No.43201489{4}[source]▶

>>43200654 #

The thing that would break my brain is that the errata is a pdf that you get from .... some link, somewhere

22. motorest ◴[28 Feb 25 06:33 UTC] No.43202413{3}[source]▶

>>43199817 #

I'm starting to believe these talks of fancy high-complexity solutions come from people who desperately try to come up with convoluted problems they create for themselves only to be able to say they did a fancy high-complexity solution. Instead of going with obvious simple reliable solutions, they opt for convoluted high-complexity unreliable hacks. Then, when they are confronted by the mess they created for themselves, they hide behind the high-complexity of their solution, as if the problem was the solution itself and not making the misjudged call to adopt it.

It's so funny how all of a sudden every single company absolutely must implement Paxos. No exception. Your average senior engineer at a FANG working with global deployments doesn't come close to even hearing about it, but these guys somehow absolutely must have Paxos. Funny.

↑