Most active commenters
  • motorest(4)
  • whstl(4)
  • EtCepeyd(3)
  • AlotOfReading(3)
  • DanielHB(3)
  • zootboy(3)

←back to thread

287 points shadaj | 100 comments | | HN request time: 2.027s | source | bottom
1. bsnnkv ◴[] No.43196091[source]
Last month I switched from a role working on a distributed system (FAANG) to a role working on embedded software which runs on cards in data center racks.

I was in my last role for a year, and 90%+ of my time was spent investigating things that went "missing" at one of many failure points between one of the many distributed components.

I wrote less than 200 lines of code that year and I experienced the highest level of burnout in my professional career.

The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it. Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

So far the culture in my new embedded (Rust, fwiw) position is the complete opposite. If you're burnt out working on distributed systems and you care about some of the same things that I do, it's worth giving embedded software dev a shot.

replies(24): >>43196122 #>>43196159 #>>43196163 #>>43196180 #>>43196239 #>>43196674 #>>43196899 #>>43196910 #>>43196931 #>>43197177 #>>43197902 #>>43198895 #>>43199169 #>>43199589 #>>43199688 #>>43199980 #>>43200186 #>>43200596 #>>43200725 #>>43200890 #>>43202090 #>>43202165 #>>43205115 #>>43208643 #
2. jasonjayr ◴[] No.43196122[source]
> Whenever I would bring up this gap I would be told that we can't spent time and wait for people to create "magic tools".

That sounds like an awful organizational ethos. 30hrs to make a "magic tool" to save 300hrs across the organization sounds like a no-brainer to anyone paying attention. It sounds like they didn't even want to invest in out-sourced "magic tools" to help either.

replies(2): >>43196181 #>>43196562 #
3. DaiPlusPlus ◴[] No.43196159[source]
> I switched from a role working on a distributed system [...] to embedded software which runs on cards in data center racks

Would you agree that, technically (or philosophically?) that both roles involved distributed systems (e.g. the world-wide-web of web-servers and web-browsers exists as a single distributed system) - unless your embedded boxes weren't doing any network IO at all?

...which makes me genuinely curious exactly what your aforementioned distributed-system role was about and what aspects of distributed-computing theory were involved.

4. anonzzzies ◴[] No.43196163[source]
I really love embedded work; at least it gives you the feeling that you have control over things. Not everything being confused and black boxed where you have to burn a goat to make it work, sometimes.
replies(1): >>43196513 #
5. beoberha ◴[] No.43196180[source]
Yep - I’ve very much been living the former for almost a decade now. It is especially difficult when the components stretch across organizations. It doesn’t quite address what the author here is getting at, but it does make me believe that this new programming model will come from academia and not industry.
6. bsnnkv ◴[] No.43196181[source]
The real kicker is that it wasn't even management saying this, it was "senior" developers on the team.

I wonder if these roles tend to attract people who get the most job enjoyment and satisfaction out of the (manual) investigation aspect; it might explain some of the reluctance to adopting or creating more sophisticated observability tooling.

replies(7): >>43196541 #>>43196620 #>>43196834 #>>43197757 #>>43200114 #>>43200855 #>>43201038 #
7. EtCepeyd ◴[] No.43196239[source]
This resonates a lot with me.

Distributed systems require insanely hard math at the bottom (paxos, raft, gossip, vector clocks, ...) It's not how the human brain works natively -- we can learn abstract thinking, but it's very hard. Embedded systems sometimes require the parallelization of some hot spots, but those are more like the exception AIUI, and you have a lot more control over things; everything is more local and sequential. Even data race free multi-threaded programming in modern C and C++ is incredibly annoying; I dislike dealing with both an explicit mesh of peers, and with a leaky abstraction that lies that threads are "symmetric" (as in SMP) while in reality there's a complicated messaging network underneath. Embedded is simpler, and it seems to require less that practitioners become advanced mathematicians for day to day work.

replies(5): >>43196342 #>>43196567 #>>43196906 #>>43197331 #>>43199711 #
8. Thaxll ◴[] No.43196342[source]
It does not requires any math because 99.9% of the time the issue is not in the low level implementation but in the business logic that the dev did.

No one goes to review the transaction engine of Postgress.

replies(1): >>43196485 #
9. EtCepeyd ◴[] No.43196485{3}[source]
I tend to disagree.

- You work on postgres: you have to deal with the transaction engine's internals.

- You work in enterprise application intergration (EAI): you have ten legacy systems that inevitably don't all interoperate with any one specific transaction manager product. Thus, you have to build adapters, message routing and propagation, gateways, at-least-once-but-idempotent delivery, and similar stuff, yourself. SQL business logic will be part of it, but it will not solve the hard problems, and you still have to dig through multiple log files on multiple servers, hoping that you can rely on unique request IDs end-to-end (and that the timestamps across those multiple servers won't be overly contradictory).

In other words: same challenges at either end of the spectrum.

replies(1): >>43196946 #
10. porridgeraisin ◴[] No.43196513[source]
> where you have to burn a goat to make it work, sometimes.

Or talk to a goat, sometimes

https://modernfarmer.com/2014/05/successful-video-game-devel...

11. zelphirkalt ◴[] No.43196541{3}[source]
Senior doesn't always mean smarter or more experienced or anything really. It just all depends on the company and its culture. It can also mean "worked for longer" (which is not equal to more experienced, as you can famously have 10 times 1y experience, instead of 10y experience) and "more aligned with how management at the company acts".
replies(2): >>43196707 #>>43199260 #
12. cmrdporcupine ◴[] No.43196562[source]
Consider that there is a class of human motivation / work culture that considers "figuring it out" to be the point of the job and just accepts or embraces complexity as "that's what I'm paid to do" and gets an ego-satisfaction from it. Why admit weakness? I can read the logs by timestamp and resolve the confusions from the CAP theorem from there!

Excessive drawing of boxes and lines, and the production of systems around them becomes a kind of Glass Bead Game. "I'm paid to build abstractions and then figure out how to keep them glued together!" Likewise, recomposing events in your head from logs, or from side effects -- that's somehow the marker of being good at your job.

The same kind of motivation underlies people who eschew or disparage GUI debuggers (log statements should be good enough or you're not a real programmer), too.

Investing in observability tools means admitting that the complexity might overwhelm you.

As an older software engineer the complexity overwhelmed me a long time ago and I strongly believe in making the machines do analysis work so I don't have to. Observability is a huge part of that.

Also many people need to be shown what observability tools / frameworks can do for them, as they may not have had prior exposure.

And back to the topic of the whole thread, too: can we back up and admit that distributed systems is questionable as an end in itself? It's a means to an end, and distributing something should be considered only as an approach when a simpler, monolithic system (that is easier to reasona bout) no longer suffices.

Finally I find that the original authors of systems are generally not the ones interested in building out observability hooks and tools because for them the way the system works (or doesn't work) is naturally intuitive because of their experience writing it.

13. motorest ◴[] No.43196567[source]
> Distributed systems require insanely hard math at the bottom (paxos, raft, gossip, vector clocks, ...) It's not how the human brain works natively -- we can learn abstract thinking, but it's very hard.

I think this take is misguided. Most of the systems nowadays, specially those involving any sort of network cals, are already distributed systems. Yet, the amount of systems go even close to touching fancy consensus algorithms is very very limited. If you are in a position to design a system and you hear "Paxos" coming out of your mouth, that's the moment you need to step back and think about what you are doing. Odds are you are creating your own problems, and then blaming the tools.

replies(2): >>43197705 #>>43199817 #
14. jbreckmckye ◴[] No.43196620{3}[source]
Why would people who are good at [scarce, valuable skill] and get paid [many bananas] to practice it want to even imagine a world where that skill is now redundant? ;-)
replies(1): >>43198474 #
15. Scramblejams ◴[] No.43196674[source]
I've often heard embedded is a nightmare of slapdashery. Any tips for finding shops that do it right?
replies(3): >>43196774 #>>43197739 #>>43198155 #
16. bongodongobob ◴[] No.43196707{4}[source]
I'd probably take 10x 1y experience. Where I'm at now, everyone has been with the company 10-40 years. They think the way they do things is the only way because they've never seen anything else. I have many stories similar to the parent. They are a decade behind in their monitoring tooling, if it even exists at all. It's so frustrating when you know there are better ways.
replies(1): >>43196888 #
17. api ◴[] No.43196774[source]
A lot of times it is, but it's not your fault. It's the fault of vendors and/or third party code you have to use.
18. Henchman21 ◴[] No.43196834{3}[source]
IME, “senior” often means “who is left after the brain-drain & layoffs are done” when you’re at a medium sized company that isn’t prominent.
replies(1): >>43199982 #
19. HPsquared ◴[] No.43196888{5}[source]
"10x1y" means someone did the same thing for 10 years with no change or personal development. The learning stopped after the first year which then repeated Groundhog Day style.
replies(1): >>43196902 #
20. alabastervlog ◴[] No.43196899[source]
I've found the rush to distributed computing when it's not strictly necessary kinda baffling. The costs in complexity are extreme. I can't imagine the median company doing this stuff is actually getting either better uptime or performance out of it—sure, it maybe recovers better if something breaks, maybe if you did everything right and regularly test that stuff (approximately nobody does though), but there's also so very much more crap that can break in the first place.

Plus: far worse performance ("but it scales smoothly" OK but your max probable scale, which I'll admit does seem high on paper if you've not done much of this stuff before, can fit on one mid-size server, you've just forgotten how powerful computers are because you've been in cloud-land too long...) and crazy-high costs for related hardware(-equivalents), resources, and services.

All because we're afraid to shell into an actual server and tail a log, I guess? I don't know what else it could be aside from some allergy to doing things the "old way"? I dunno man, seems way simpler and less likely to waste my whole day trying to figure out why, in fact, the logs I need weren't fucking collected in the first place, or got buried some damn corner of our Cloud I'll never find without writing a 20-line "log query" in some awful language I never use for anything else, in some shitty web dashboard.

Fewer, or cheaper, personnel? I've never seen cloud transitions do anything but the opposite.

It's like the whole industry went collectively insane at the same time.

[EDIT] Oh, and I forgot, for everything you gain in cloud capabilities it seems like you lose two or three things that are feasible when you're running your own servers. Simple shit that's just "add two lines to the nginx config and do an apt-install" becomes three sprints of custom work or whatever, or just doesn't happen because it'd be too expensive. I don't get why someone would give that stuff up unless they really, really had to.

[EDIT EDIT] I get that this rant is more about "the cloud" than distributed systems per se, but trying to build "cloud native" is the way that most orgs accidentally end up dealing with distributed systems in a much bigger way than they have to.

replies(10): >>43197578 #>>43197608 #>>43197740 #>>43199134 #>>43199560 #>>43201628 #>>43201737 #>>43202751 #>>43204072 #>>43225726 #
21. bongodongobob ◴[] No.43196902{6}[source]
Ah, I misunderstood.
replies(2): >>43199036 #>>43199046 #
22. AlotOfReading ◴[] No.43196906[source]
Most embedded systems are distributed systems these days, there's simply a cultural barrier that prevents most practitioners from fully grappling with that fact. A lot of systems I've worked on have benefited from copying ideas invented by distributed systems folks working on networking stuff 20 years ago.
replies(3): >>43197438 #>>43199391 #>>43199780 #
23. bob1029 ◴[] No.43196910[source]
> Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

I've never once been granted explicit permission to try a different path without being burdened by a mountain of constraints that ultimately render the effort pointless.

If you want to try a new thing, just build it. No one is going to encourage you to shoot holes through things that they hang their own egos from.

replies(1): >>43200044 #
24. im_down_w_otp ◴[] No.43196931[source]
We built a bunch of tools & technology for leveraging observability (docs.auxon.io) to do V&V, stress testing, auto root-cause analysis, etc. in clusters of embedded development (all of it built in Rust too :waves: ), since the same challenges exist for folks building vehicle platforms, lunar rovers, drones, etc. Both within a single system as well as across fleets of systems. Many embedded developers are actually distributed systems developers... they just don't think of it that way.

It's often quite a challenge to get that class of engineer to adopt things that give them visibility and data to track things down as well. Sometimes it's just a capability/experience gap and sometimes it's just over indexing on a perception of time getting to a solution vs. the time wasted on repeated problems and yak shavings.

25. pfannkuchen ◴[] No.43196946{4}[source]
Yeah this is kind of an abstraction failure of the infrastructure. Ideally the surface visible to the user should be simple across the entire spectrum of use cases. In some very, very rare cases one necessarily has to spelunk under the facade and know something about the internals, but for some reason it seems to happen much more often in the real world. I think people often don't put enough effort into making their system model fit with the native model of the infrastructure, and instead torture the infrastructure interface (often including the "break glass" parts) to fit into their a priori system model.
26. lumost ◴[] No.43197177[source]
Anecdotally, I see a major under appreciation for just how fast and efficient modern hardware is in the distributed systems community.

I’ve seen a great many engineers become so used to provisioning compute that they forget that the same “service” can be deployed in multiple places. Or jump to building an orchestration component when a simple single process job would do the trick.

27. toast0 ◴[] No.43197331[source]
That's true, but you can do a lot of that once, and then get on with your life, if you build the right structures. I've gotten a huge amount of mileage from consensus to decide where to send reads/writes to, then everyone sends their reads/writes for the same piece of data to the same place; that place does the application logic where it's simple, and sends the result back. If you don't get the result back in time, bubble it up to the end-user application and it may retry or not, depending.

This is built upon a framework of the network is either working or the server team / ops team is paged and will be actively trying to figure it out. It doesn't work nearly as well if you work in an environment where the network is consistently slightly broken.

28. DanielHB ◴[] No.43197438{3}[source]
I worked in an IoT platform that consisted of 3 embedded CPUs and one linux board. The kicker was that the linux board could only talk directly to one of the chips, but had to be capable of updating the software running on all of them.

That platform was parallelizable of up to 6 of its kind in a master-slave configuration (so the platform in the physical position 1 would assume the "master role" for a total of 18 embedded chips and 6 linux boards) on top of having optionally one more box with one more CPU in it for managing some other stuff and integrating with each of our clients hardware. Each client had a different integration, but at least they mostly integrated with us, not the other way around.

Yeah it was MUCH more complex than your average cloud. Of course the original designers didn't even bother to make a common network protocol for the messages, so each point of communication not only used a different binary format, they also used different wire formats (CAN bus, Modbus and ethernet).

But at least you didn't need to know kubernetes, just a bunch of custom stuff that wasn't well documented. Oh yeah and don't forget the boot loaders for each embedded CPU, we had to update the bootloaders so many times...

The only saving grace is that a lot of the system could rely on the literal physical security because you need to have physical access (and a crane) to reach most of the system. Pretty much only the linux boards had to have high security standards and that was not that complicated to lock down (besides maintaining a custom yocto distribution that is).

replies(1): >>43197538 #
29. AlotOfReading ◴[] No.43197538{4}[source]
Many automotive systems have >100 processors scattered around the vehicle, maybe a dozen of which are "important". I'm amazed they ever work given the quality of the code running on them.
replies(1): >>43197571 #
30. DanielHB ◴[] No.43197571{5}[source]
A LOT of QA
31. throwawaymaths ◴[] No.43197578[source]
the minute you have a client (browser, e.g.) and a server you're doing a distributed system and you should be thinking a little bit about edge cases like loss of connection, incomplete tx. a lot of the goto protocols (tcp, http, even stuff like s3) are built with the complexities of distributed systems in mind so for most basic cases, a little thought goes a long way. but you get weird shit happening all the time (that may be tolerable) if you don't put any effort into it.
32. jimbokun ◴[] No.43197608[source]
Distributed or not is a very binary function. If you can run in one large server, great, just write everything in non-distributed fashion.

But once you need that second server, everything about your application needs to work in distributed fashion.

replies(2): >>43198610 #>>43228522 #
33. convolvatron ◴[] No.43197705{3}[source]
this is completely backwards. the tools may have some internal consistency guarantees, handle some classes of failures, etc. They are leaky abstractions that are partially correct. There were not collectively designed to handle all failures and consistent views no matter their composition.

From the other direction, Paxos, two generals, serializability, etc. are not hard concepts at all. Implementing custome solutions in this space _is_ hard and prone to error, but the foundations are simple and sound.

You seem to be claiming that you shouldn't need to understand the latter, that the former gives you everything you need. I would say that if you build systems using existing tools without even thinking about the latter, you're just signing up to handling preventable errors manually and treating this box that you own and black and inscrutable.

34. DanielHB ◴[] No.43197739[source]
There is inherent complexity and self-inflicted complexity, they tend to go hand in hand but self-inflicted complexity can be exacerbated in bad projects. A lot of embedded software is just inherent complex, cars for example.
35. dekhn ◴[] No.43197740[source]
I am always happy when I can take a system that is based on distributed computing, and convert it to a stateless single machine job that runs just as quickly but does not have the complexity associated with distributed computing.

Reccently I was going to do a fairly big download of a dataset (45T) and when I first looked at it, figured I could shard the file list and run a bunch of parallel loaders on our cluster.

Instead, I made a VM with 120TB storage (using AWS with FSX) and ran a single instance of git clone for several days (unattended; just periodically checking in to make sure that git was still running). The storage was more than 2X the dataset size because git LFS requires 2X disk space. A single multithreaded git process was able to download at 350MB/sec and it finished at the predicted time (about 3 days). Then I used 'aws sync' to copy the data back to s3, writing at over 1GB/sec. When I copied the data between two buckets, the rate was 3GB/sec.

That said, there are things we simply can't do without distributed computing because there are strong limits on how many CPUs and local storage can be connected to a single memory address space.

replies(1): >>43198773 #
36. the_sleaze_ ◴[] No.43197757{3}[source]
_To play devils advocate_: It could've sounded like the "new guy" came in and decided he needed to rewrite everything; bring in new xyx; steer the ship. New guy could even have been stepping directly on the toes of those senior developers who had fought and won wars to get were they are now.

In my -very- humble opinion, you should wait at least a year before making big swinging changes or recommendations, most importantly in any big company.

replies(1): >>43198942 #
37. ithkuil ◴[] No.43197902[source]
10 years ago I went on a similar journey. I left faang to work on a startup working on embedded firmware for esp8266. The lack of tooling was very frustrating. I ended up writing a gdb stub (before espressif released one) and a malloc debugger (via serial port) just to manage to get shit done.
38. AlotOfReading ◴[] No.43198155[source]
It's not foolproof, but I've found there's a strong correlation between product margin and the sanity of the dev experience.
39. filoleg ◴[] No.43198474{4}[source]
The real skill is “problem-solving”, not “doing lots of specific manual steps that could be automated and made easier.”

Unfortunately, some people confuse the two and believe they are paid to do the latter, not the former, simply because others look at those steps and go “wtf, we could make that hell more pleasant and easier to deal with”.

In the same vein, “creating perceived job security for yourself by willing to continuously deal with stupid bs that others rightfully aren’t interested in wasting time on.”

Sadly, you are ultimately right though, as misguided self-interest often tends to win over well-meant proposals.

replies(1): >>43201171 #
40. th0ma5 ◴[] No.43198610{3}[source]
I wish I could upvote you again. The complexity balloons when you try to adapt something that wasn't distributed, and often things can be way simpler and more robust if you start with a distributed concept.
replies(1): >>43207352 #
41. achierius ◴[] No.43198773{3}[source]
My wheelhouse is lower on the stack, so I'm curious as to what you mean by "stateless single machine job" -- do you just mean that it runs from start to end, without options for suspension/migration/resumption/etc.?
replies(1): >>43199143 #
42. bryanlarsen ◴[] No.43198895[source]
I think you were unlucky in your distributed system job and lucky in your embedded job. Embedded is filled with crappy 3rd party and in-house tooling, far more so than distributed, in my experience. That crappiness perhaps leads to a higher likelihood to spend time on them, but it doesn't have to.

Embedded does give you a greater feeling of control. When things aren't working, it's much more likely to be your own fault.

43. jiggawatts ◴[] No.43198942{4}[source]
In my less humble opinion: the only honest and objective review you’ll get about a system is from a new hire for about a month. Measure the “what the fucks per hour” as a barometer of how bad your org is and how deep a hole it has dug itself into.

After that honeymoon period, all but the most autistic people will learn the organisational politics, keep their head down, and “play the game” to be assigned trivial menial tasks in some unimportant corner of the system. At that point, only after two beers will they give their closest colleagues their true opinion.

I’ve seen this play out over and over, organisation after organisation.

The corollary is that you yourself are not immune to this effect and will grow accustomed to almost any amount of insanity. You too will find yourself saying sentences like “oh, it always has been like this” and “don’t try to change that” or “that’s the responsibility of another team” even though you know full well they’re barely even aware of what that thing is, let alone maintaining it in a responsible fashion.

PS: This is my purpose in a nutshell as a consultant. I turn up and provide my unvarnished opinion, without being even aware of what I’m “not supposed to say” because “it upsets that psychotic manager”. I’ll be gone before I have any personal political consequences, but the report document will remain, pointing the finger at people that would normally try to bite it off.

replies(2): >>43199994 #>>43207753 #
44. lazystar ◴[] No.43199036{7}[source]
another term for the phenomena is the "expert beginner" trap.
45. Nevermark ◴[] No.43199046{7}[source]
I see a dual. Between 10x1 workers and 1x10 workers working at 10x1 companies.

Either way, doing the same kinds of things, the same kind of ways, more than a few times, is an automation/tool/practice improvement opportunity lost.

I have yet to complete a single project I couldn't do much better, differently, if I were to do something similar again. Not everything is high creative, but software is such a complex balancing act/value terrain. Every project should deliver some new wisdom, however modest.

46. whstl ◴[] No.43199134[source]
I share your opinions, and really enjoyed your rant.

But it's funny. The transition to distributed/cloud feels like the rush to OOP early in my career. All of a sudden there were certain developers who would claim it was impossible to ship features in procedural codebases, and then proceed to make a fucking mess out of everything using classes, completely misunderstanding what they were selling.

It is also not unlike what Web-MVC felt like in the mid-2000s. Suddenly everything that came before was considered complete trash by some people that started appearing around me. Then the same people disparaging the old ways started building super rigid CRUD apps with mountains of boilerplate.

(Probably the only thing I was immediately on board with was the transition from desktop to web, because it actually solved more problems than it created. IMO, IME and YYMV)

Later we also had React and Docker.

I'm not salty or anything: I also tried and became proficient in all of those things. Including microservices and the cloud. But it was more out of market pressure than out of personal preference. And like you said, it has a place when it's strictly necessary.

But now I finally do mostly procedural programming, in Go, in single servers.

replies(1): >>43199926 #
47. dekhn ◴[] No.43199143{4}[source]
it's a pretty generic term but in my mind I was thinking of a job that ran on a machine with remote attached storage (EBS, S3, etc); the state I meant was local storage.
48. bagels ◴[] No.43199169[source]
Which company? Doesn't sound like the infra org I was in at a FAANG
49. fuzztester ◴[] No.43199260{4}[source]
I have heard it as 20 versus 1, but it is the same thing.

also called by some other names, including NIH syndrome, protecting your turf, we do it this way around here, our culture, etc.

50. zootboy ◴[] No.43199391{3}[source]
Indeed. I've been building systems that orchestrate batteries and power sources. Turns out, it's a difficult problem to temporally align data points produced by separate components that don't share any sort of common clock source. Just take the latest power supply current reading and subtract the latest battery current reading to get load current? Oops, they don't line up, and now you get bizarre values (like negative load power) when there's a fast load transient.

Even more fun when multiple devices share a single communication bus, so you're basically guaranteed to not get temporally-aligned readings from all of the devices.

replies(1): >>43200688 #
51. FpUser ◴[] No.43199560[source]
This is part of what I do for living. C++ backend software running on real hardware which is currently insanely powerful. There is of course spare standby in case things go South. Works like a charm and I have yet to have a client that scratched it anywhere close to overloading server.

I understand that it can not deal with FAANG scale problems, but those are relevant only to a small subset of businesses.

replies(1): >>43200244 #
52. yolovoe ◴[] No.43199589[source]
Is the “card” work EC2 Nitro by any chance? Sounds similar to what I used to do
53. englishspot ◴[] No.43199688[source]
curious as to how you made that transition. seems like that'd be tough in today's job market.
54. PaulDavisThe1st ◴[] No.43199711[source]
> Even data race free multi-threaded programming in modern C and C++ is incredibly annoying; I dislike dealing with both an explicit mesh of peers, and with a leaky abstraction that lies that threads are "symmetric" (as in SMP) while in reality there's a complicated messaging network underneath.

If you're using traditional (p)threads-derived APIs to get work done on a message passing system, I'd say you're using the wrong API.

More likely, I don't understand what you might mean here.

replies(1): >>43199965 #
55. anitil ◴[] No.43199780{3}[source]
Yes even 'simple' devices these days will have devices (ADC/SPI etc) running in parallel often using DMA, multiple semi-independent clocks, possibly nested interrupts etc. Oh and the UART for some reason always, always has bugs, so hopefully you're using multiple levels of error checking.
replies(1): >>43200654 #
56. yodsanklai ◴[] No.43199817{3}[source]
I remember when I prepared for system design interviews in FAANG, I was anxious I would get asked about Paxos (which I learned at school). Now that I'm working there, never heard about Paxos or fancy distributed algorithms. We rely on various high-level services for deployment, partitioning, monitoring, logging, service discovery, storage...

And Paxos doesn't require much maths. It's pretty tricky to consider all possible interleavings, but in term of maths, it's really basic discrete maths.

replies(1): >>43202413 #
57. sakesun ◴[] No.43199926{3}[source]
Your comment inspire me to brush up my Delphi skill.
58. EtCepeyd ◴[] No.43199965{3}[source]
Sorry, I figure I ended up spewing a bit of gibberish.

- By "explicit mesh of peers", I referred to atomics, and the modern (C11 and later) memory model. The memory model, for example as written up in the C11 and later standards, is impenetrable. While the atomics interfaces do resemble a messaging passing system between threads, and therefore seem to match the underlying hardware closely, they are discomforting because their foundation, the memory model, is in fact laid out in the PhD dissertation of Mark John Batty, "The C11 and C++11 Concurrency Model" -- 400+ pages! <https://www.cl.cam.ac.uk/~pes20/papers/topic.c11.group_abstr...>

- By "leaky abstraction", I mean the stronger posix threads / standard C threads interfaces. They are more intuitive and safer, but are more distant from the hardware, so people sometimes frown at them for being expensive.

59. alfiedotwtf ◴[] No.43199980[source]
I have talked to many people in the Embedded space doing Rust, and every single one of them had the biggest grin while talking about work. Sounds like you’ll have fun :)
60. getnormality ◴[] No.43199982{4}[source]
I feel so seen.
61. disqard ◴[] No.43199994{5}[source]
This rings true in my experience across different orgs, teams, in the tech industry.

FWIW, academia has off-the-charts levels of "wtf" that newcomers will point out, though it's even more ossified than corporate culture, and they don't hire consultants to come in and fix things :)

replies(1): >>43201140 #
62. DrFalkyn ◴[] No.43200044[source]
Hope you can justify that during sprint planning / standup
replies(1): >>43200280 #
63. Jach ◴[] No.43200114{3}[source]
There's also immense resistance to figuring out how to code something if an approach isn't at once obvious. Hence "magic". Sometimes a "spike doc" can convince people. My favorite second-hand instance of this was a MS employee insisting that a fast rendering terminal emulator was so hard as to require "an entire doctoral research project in performant terminal emulation".
64. intelVISA ◴[] No.43200186[source]
Distributed systems always ends up a dumping ground of failed tech solutions to deep org dysfunction.

Weak tech leadership? Let's "fix" that with some microservices.

Now it's FUBAR? Conceal it with some cloud native horrors, sacrifice a revolving door of 'smart' disempowered engineers to keep the theater going til you can jump to the next target.

Funny because dis sys is pretty solved since Lamport, 40+ years ago.

replies(2): >>43200288 #>>43200321 #
65. intelVISA ◴[] No.43200244{3}[source]
The highly profitable, self-inflicted problem of using 200 QPS Python frameworks everywhere.
66. bob1029 ◴[] No.43200280{3}[source]
If you are going to just build it in the absence of explicit buy-in, you certainly shouldn't spend time on the standup talking about it. Wait until your idea is completely formed and then drop a 5 minute demo on the team.

It can be challenging to push through to a completed demo without someone cheering you on every morning. I find this to be helpful more than hurtful if we are interested in the greater good. If you want to go against the grain (everyone else on the team), then you need to be really sure before you start wasting everyone else's time. Prove it to yourself first.

67. rbjorklin ◴[] No.43200288[source]
Would you mind sharing some more specific information/references to Lamport’s work?
replies(2): >>43200329 #>>43200392 #
68. whstl ◴[] No.43200321[source]
I suffered through this in two companies and man, it isn't easy.

First one was a multi-billion-Unicorn had everything converted to microservices, with everything customized in Kubernetes. One day I even had to fix a few bugs in the service mesh because the guy who wrote it left and I was the only person not fighting fires able to write the language it was in. I left right after the backend-of-the-frontend failed to sustain traffic during a month where they literally had zero customers (Corona).

At the second one there was a mandate to rewrite everything to microservices and it took another team 5 months to migrate a single 100-line class I wrote into a microservice. It just wasn't meant to be. Then the only guy who knows how the infrastructure works got burnout after being yelled at too many times and then got demoted, and last I heard is at home with depression.

Weak leadership doesn't even begin to describe it, especially the second.

But remembering it is a nice reminder that a job is just a means of getting a payment.

69. vitus ◴[] No.43200329{3}[source]
The three big papers: clocks [0], Paxos [1], Byzantine generals [2].

[0] https://lamport.azurewebsites.net/pubs/time-clocks.pdf

[1] https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf

[2] https://lamport.azurewebsites.net/pubs/byz.pdf

Or, if you prefer wiki articles:

https://en.wikipedia.org/wiki/Lamport_timestamp

https://en.wikipedia.org/wiki/Paxos_(computer_science)

https://en.wikipedia.org/wiki/Byzantine_fault

I don't know that I would call it "solved", but he certainly contributed a huge amount to the field.

70. madhadron ◴[] No.43200392{3}[source]
Lamport's website has his collected works. The paper to start with is "Time, clocks, and the ordering of events in a distributed system." Read it closely all the way to the end. Everyone seems to miss the last couple sections for some reason.
71. sly010 ◴[] No.43200596[source]
I don't disagree, but funny that I recently made a point to someone that modern consumer embedded systems (with multiple MCUs connected with buses and sometimes shared memory) are basically small distributed systems, because partial restarts are common and the start/restart order of the MCUs is not very well defined. At least in the space I am working in. (Needless to say we use C, not rust)
72. zootboy ◴[] No.43200654{4}[source]
Yeah, it was a "fun" surprise to discover the errata sheet for the microcontroller I was working with after beating my head against the wall trying to figure out why it doesn't do what the reference manual says it should do. It's especially "fun" when the errata is "The hardware flow control doesn't work. Like, at all. Just don't even try."
replies(1): >>43201489 #
73. szvsw ◴[] No.43200688{4}[source]
I run a small SaaS side hustle where the core value proposition of the product - at least what got us our first customers, even if they did not realize what was happening under the hood - is, essentially, an implementation of NTP running over HTTPS that can be run on some odd devices and sync those devices to mobile phones via a front end app and backend server. There’s some other CMS stuff that makes it easy for the various customers to serve their content to their customers’ devices, but at the end of the day our core trade secret is just using a roll-your-own NTP implementation… I love how NTP is just the tip of the iceberg when it comes to the wicked problem of aligning clocks. This is all just to say - I feel your pain, but also not really since it sounds like you are dealing with higher precision and greater challenges than I ever had to!

Here’s a great podcast on the topic which you will surely like!

https://signalsandthreads.com/clock-synchronization/

And a related HN thread in case you missed it:

https://news.ycombinator.com/item?id=39298652

replies(1): >>43200976 #
74. fons ◴[] No.43200725[source]
Would you mind disclosing your current employer? I am also interested in moving to an embedded systems role.
75. scottlamb ◴[] No.43200855{3}[source]
> I wonder if these roles tend to attract people who get the most job enjoyment and satisfaction out of the (manual) investigation aspect; it might explain some of the reluctance to adopting or creating more sophisticated observability tooling.

That's weird. I love debugging, and so I'm always trying to learn new ways to do it better. I mean, how can it be any other way? How can someone love something and be that committed to sucking at it?

76. fra ◴[] No.43200890[source]
As someone who builds observability tools for embedded software, I am flabbergasted that you're finding a more tools-friendly culture in embedded than in distributed systems!

Most hardware companies have zero observability, and haven't yet seen the light ("our code doesn't really have bugs" is a quote I hear multiple times a week!).

replies(1): >>43201003 #
77. zootboy ◴[] No.43200976{5}[source]
The ultimate frustration is when you have no real ability to fix the core problem. NTP (and its 'roided-up cousin PTP) are great, but they require a degree of control and influence over the end devices that I just don't have. No amount of pleading will get a battery vendor to implement NTP in their BMS firmware, and I don't have nearly enough stacks of cash to wave around to commission a custom firmware. So I'm pretty much stuck with the "black box cat herding" technique of interoperation.
replies(1): >>43201262 #
78. whstl ◴[] No.43201003[source]
It's probably a "grass is greener" situation.

My experience with mid-size to enterprise is having lots of observability and observability-adjacent tools purchased but not properly configured. Or the completely wrong tools for the job being used.

A few I've seen recently: Grafana running on local Docker of developers because of lack of permissions in the production version (the cherry on top: the CTO himself installed this on the PMs computers), Prometheus integration implemented by dev team but env variables still missing after a couple years, several thousand a month being paid to Datadog but nothing being done with the data nor with the dog.

On startups it's surprisingly different, IME. But as soon as you "elect" a group to be administrator of a certain tool or some resource needed by those tools, you're doomed.

79. whstl ◴[] No.43201038{3}[source]
I saw a case like this recently, and the fact is that the team responsible was completely burned out and was just doing anything to avoid people from giving them more work, but they also didn't trust anyone else to do it.

One of the engineers just quit on the spot for a better paid position, the other was demoted and is currently under heavy depression last I heard from him.

80. fc417fc802 ◴[] No.43201140{6}[source]
Not sure which specific field you have in mind there but many parts of academia also have off the charts levels of, as GP put it, "the most autistic people". Outside of the university bureaucracy (which is its own separate thing) nearly all of the "wtf" that I encountered there had good reasons behind it. Often simply "we don't have the cash" but also frequently things that seemed weird or wrong at first glance but were actually better given the goals in that specific case.

Interfacing with IT, who thought they knew the "right" way to do everything but in reality had little to no understanding of our constraints, was always interesting.

81. fc417fc802 ◴[] No.43201171{5}[source]
If the goal is ensuring a future stream of bananas then can you really say the behavior is misguided?
82. szvsw ◴[] No.43201262{6}[source]
Yeah, that makes sense. We are lucky in that we get to deploy our code to the devices. It’s not really “embedded” in the sense most people use as these are essentially sandboxed Linux devices that only run applications written in a programming language specific to these devices which is similar to Lua/python but the scripts get turned into byte code at boot IIRC, but none the less very powerful/fast.

You work on BMS stuff? That’s cool- a little bit outside my domain (I do energy modeling research for buildings) but have been to some fun talks semi-recently about BMs/BAS/telemetry in buildings etc. The whole landscape seems like a real mess there.

FYI that podcast I linked has some interesting discussion about some issues with PTP over NTP- worth listening to for sure.

83. anitil ◴[] No.43201489{5}[source]
The thing that would break my brain is that the errata is a pdf that you get from .... some link, somewhere
84. tayo42 ◴[] No.43201628[source]
This rant misses two things that people always miss

On distributed. Qps scaling isn't the only reason and I suspect rarely the reason. It's mostly driven by availability needs.

It's also driven my organizational structure and teams. Two teams don't need to be fighting over the same server to deploy their code. So it gets broken out into services with clear api boundaries.

And ssh to servers might be fine for you. But systems and access are designed to protect the bottom tier of employees that will mess things up when they tweak things manually. And tweaking things by hand isn't reproducible when they break.

replies(2): >>43201784 #>>43202839 #
85. Karrot_Kream ◴[] No.43201784{3}[source]
Horizontal scaling is also a huge cost savings. If you can run your application with a tiny VM most of the time and scale it up when things get hot, then you save money. If you know your service is used during business hours you can provision extra capacity during business hours and release that capacity during off hours.
86. lelanthran ◴[] No.43202090[source]
I spent the majority of my career as an embedded dev. There are ... different ... challenges, and I'm not so sure that I would want to go back to it.

It pays poorly, the tooling more often than not sucks (more than once I've had to do some sort of stub for an out-of-date gcc), observability is non-existent unless you're looking at a device on your desk, in which case your observability tool is an oscilloscope (or bus pirate type of device, if you're lucky in having the lower layers completely free of bugs).

The datasheets/application notes are almost always incomplete, with errata (in a different document) telling you "Yeah, that application note is wrong, don't do that".

The required math background can be strict as well: R/F, analog ... basically anything interesting you want to do requires a solid grounding in undergrad maths.

I went independent about 2 years ago. You know what pays better and has less work? Line of business applications. I've delivered maybe two handfuls of LoB applications but only one embedded system, and my experience with doing that as a contractor is that I won't take an embedded contract anymore unless it's a client I've already done work for, or if the client is willing to pay 75% upfront, and they agree to a special hourly rate that takes into account my need for maintaining all my own equipment.

87. junon ◴[] No.43202165[source]
Can concur, I also switched mostly to firmware and have enjoyed it much more. Though Rust firmware jobs are hard to come by.
88. motorest ◴[] No.43202413{4}[source]
I'm starting to believe these talks of fancy high-complexity solutions come from people who desperately try to come up with convoluted problems they create for themselves only to be able to say they did a fancy high-complexity solution. Instead of going with obvious simple reliable solutions, they opt for convoluted high-complexity unreliable hacks. Then, when they are confronted by the mess they created for themselves, they hide behind the high-complexity of their solution, as if the problem was the solution itself and not making the misjudged call to adopt it.

It's so funny how all of a sudden every single company absolutely must implement Paxos. No exception. Your average senior engineer at a FANG working with global deployments doesn't come close to even hearing about it, but these guys somehow absolutely must have Paxos. Funny.

89. motorest ◴[] No.43202751[source]
> I've found the rush to distributed computing when it's not strictly necessary kinda baffling.

I'm not entirely sure you understand the problem domain, or even the high-level problem. The is or ever was a "rush" to distributed computing.

What you actually have is this global epifany that having multiple computers communicating over a network to do something actually has a name, and it's called distributed computing.

This means that we had (and still have) guys like you who look at distributed systems and somehow do not understand they are looking at distributed systems. They don't understand that mundane things like a mobile app supporting authentication or someone opening a webpage or email is a distributed system. They don't understand that the discussion on monolith vs microservices is orthogonal to the topic of distributed systems.

So the people railing against distributed systems are essentially complaining about their own ignorance and failure to actually understand the high-level problem.

You have two options: acknowledge that, unless you're writing a desktop app that does nothing over a network, odds are every single application you touch is a node in a distributed system, or keep fooling yourself into believing it isn't. I mean, if a webpage fails to load then you just hit F5, right? And if your app just fails to fetch something from a service you just restart it, right? That can't possibly be a distributed system, and those scenarios can't possibly be mitigated by basic distributed computing strategies, isn't it?

Everything is simple to those who do not understand the problem, and those who do are just making things up.

replies(1): >>43206886 #
90. ◴[] No.43202839{3}[source]
91. ahartmetz ◴[] No.43204072[source]
>It's like the whole industry went collectively insane at the same time.

Welcome to computing.

- OOP will solve all of our problems

- P2P will solve all of our problems

- XML will solve all of our problems

- SOAP will solve all of our problems

- VMs will solve all of our problems

- Ruby on Rails and by extension dynamically typed languages will solve all of our problems

- Docker [etc...]

- Functional programming

- node.js

- Cloud

- Kubernetes

- Statically typed languages

- "Serverless"

- Rust?

- AI

Some have more merit (IMO notably FP, static typing and Rust), some less (notably XML and SOAP)...

92. fatnoah ◴[] No.43205115[source]
> The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it.

One of the most significant "triumphs" of my technical career came at a startup where I started as a Principal Engineer and left as the VP Engineering. When I started, we had nightly outages requiring Engineering on-call, and by the time I left, no one could remember a recent issue that required Engineers to wake up.

It was a ton of work and required a strong investment in quality & resilience, but even bigger impact was from observability. We couldn't afford APM, so we took a very deliberate approach to what we logged and how, and stuffed it into an ELK stack for reporting. The immediate benefit was a drastic reduction in time to diagnose issues, and effectively let our small operations team triage issues and easily identify app vs. infra issues almost immediately. Additionally, it was much easier to identify and mitigate fragility in our code and infra.

The net result was an increase in availability from 98.5% to 99.995%, and I think observability contributed to at least half of that.

93. lucyjojo ◴[] No.43206886{3}[source]
you and the guy you are answering too are not talking the same language (technically yes but you are putting different meanings to the same words).

this would lead to a pointless conversation, if it were to ever happen.

replies(1): >>43217147 #
94. CogitoCogito ◴[] No.43207352{4}[source]
I couldn't disagree more. My principle is to write systems extremely simply and then distribute portions of it as it becomes necessary. Almost always it never becomes necessary and the rare cases it does, it is entirely straight forward to do so unless you have an over-complicated design. I don't think I've ever seen it done well when done in the opposite direction. It's always cost more in time and effort and resulted in something worse.
replies(1): >>43216746 #
95. the_sleaze_ ◴[] No.43207753{5}[source]
This is quite a defensive posture. In my current role I've been able to see an incredible raft of insanity, not be obtuse or arrogant enough to dismiss solutions or the intelligence of those who made them, but literally make a communal list of refactor candidates. Then slowly but surely wrangle people and political capital to my side to eventually change them. Years later we still have cruft leftover but there are many many projects, some multi-year, which are now complete.

I also see a single-mindedness to specific technical implementations where a more mature view would be to see tech as a business and us less as artisans than blue collar workers.

> steve jobs on "You're right, but it doesn't matter" https://www.youtube.com/watch?v=oeqPrUmVz-o

replies(1): >>43212208 #
96. literallyroy ◴[] No.43208643[source]
How did you make that transition/find a position? Were you already using Rust in a previous role?
97. jiggawatts ◴[] No.43212208{6}[source]
Your attitude is commendable! It’s what a true leader should do, and you deserve to be promoted for it.

My comment was a statistical observation of what typically happens in ordinary organisations without a strong-willed, technically capable leader at the helm.

Disclaimer: Also, I have a biased view, because as a consultant I will generally only turn up if there is something already wrong with an organisation that insiders are unable to fix.

98. th0ma5 ◴[] No.43216746{5}[source]
Tons of vendors offer cloud first, distributed deployments. Erlang is distributed by default. Spark is distributed by default. Most databases are distributed by default.
replies(1): >>43228529 #
99. motorest ◴[] No.43217147{4}[source]
> you and the guy you are answering too are not talking the same language (technically yes but you are putting different meanings to the same words).

That's the point, isn't it? It's simply wrong to assert that there's a rush to distributed systems when they are already ubiquitous in the real world, even if this comes as a surprise to people like OP. Get acquainted with the definition of distributed computing, and look at reality.

The only epiphany taking place is people looking at distributed systems and thinking that, yes, perhaps they should be treated as distributed systems. Perhaps the interfaces between multiple microservices are points of failure, but replacing them with a monolith does not make it less of a distributed system. Worse, taking down your monolith is also a failure mode, one with higher severity. How do you mitigate that failure mode? Well, educate yourself about distributed computing.

If you look at a distributed system and call it something other than distributed system, are you really speaking a different language, or are you simply misguided?

100. icedchai ◴[] No.43225726[source]
I've seen this as well. A relatively simple application becomes a mess of terraform configuration for CloudFront, Lambda, API Gateway, S3, RDS and a half dozen other lesser services because someone had an obsession with "serverless." And performance is worse. And there's as much Terraform as there is actually application code.