Distributed systems programming has stalled

Last month I switched from a role working on a distributed system (FAANG) to a role working on embedded software which runs on cards in data center racks.

I was in my last role for a year, and 90%+ of my time was spent investigating things that went "missing" at one of many failure points between one of the many distributed components.

I wrote less than 200 lines of code that year and I experienced the highest level of burnout in my professional career.

The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it. Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

So far the culture in my new embedded (Rust, fwiw) position is the complete opposite. If you're burnt out working on distributed systems and you care about some of the same things that I do, it's worth giving embedded software dev a shot.

> The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it.

One of the most significant "triumphs" of my technical career came at a startup where I started as a Principal Engineer and left as the VP Engineering. When I started, we had nightly outages requiring Engineering on-call, and by the time I left, no one could remember a recent issue that required Engineers to wake up.

It was a ton of work and required a strong investment in quality & resilience, but even bigger impact was from observability. We couldn't afford APM, so we took a very deliberate approach to what we logged and how, and stuffed it into an ELK stack for reporting. The immediate benefit was a drastic reduction in time to diagnose issues, and effectively let our small operations team triage issues and easily identify app vs. infra issues almost immediately. Additionally, it was much easier to identify and mitigate fragility in our code and infra.

The net result was an increase in availability from 98.5% to 99.995%, and I think observability contributed to at least half of that.