Most active commenters

Distributed systems programming has stalled

(www.shadaj.me)

Show context

bsnnkv ◴[27 Feb 25 16:52 UTC] No.43196091[source]▶

>>43195702 (OP) #

Last month I switched from a role working on a distributed system (FAANG) to a role working on embedded software which runs on cards in data center racks.

I was in my last role for a year, and 90%+ of my time was spent investigating things that went "missing" at one of many failure points between one of the many distributed components.

I wrote less than 200 lines of code that year and I experienced the highest level of burnout in my professional career.

The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it. Whenever I would bring up this gap I would be told that we can't spend time/money and wait for people to create "magic tools".

So far the culture in my new embedded (Rust, fwiw) position is the complete opposite. If you're burnt out working on distributed systems and you care about some of the same things that I do, it's worth giving embedded software dev a shot.

replies(24): >>43196122 #>>43196159 #>>43196163 #>>43196180 #>>43196239 #>>43196674 #>>43196899 #>>43196910 #>>43196931 #>>43197177 #>>43197902 #>>43198895 #>>43199169 #>>43199589 #>>43199688 #>>43199980 #>>43200186 #>>43200596 #>>43200725 #>>43200890 #>>43202090 #>>43202165 #>>43205115 #>>43208643 #

1. alabastervlog ◴[27 Feb 25 18:16 UTC] No.43196899[source]▶

>>43196091 #

I've found the rush to distributed computing when it's not strictly necessary kinda baffling. The costs in complexity are extreme. I can't imagine the median company doing this stuff is actually getting either better uptime or performance out of it—sure, it maybe recovers better if something breaks, maybe if you did everything right and regularly test that stuff (approximately nobody does though), but there's also so very much more crap that can break in the first place.

Plus: far worse performance ("but it scales smoothly" OK but your max probable scale, which I'll admit does seem high on paper if you've not done much of this stuff before, can fit on one mid-size server, you've just forgotten how powerful computers are because you've been in cloud-land too long...) and crazy-high costs for related hardware(-equivalents), resources, and services.

All because we're afraid to shell into an actual server and tail a log, I guess? I don't know what else it could be aside from some allergy to doing things the "old way"? I dunno man, seems way simpler and less likely to waste my whole day trying to figure out why, in fact, the logs I need weren't fucking collected in the first place, or got buried some damn corner of our Cloud I'll never find without writing a 20-line "log query" in some awful language I never use for anything else, in some shitty web dashboard.

Fewer, or cheaper, personnel? I've never seen cloud transitions do anything but the opposite.

It's like the whole industry went collectively insane at the same time.

[EDIT] Oh, and I forgot, for everything you gain in cloud capabilities it seems like you lose two or three things that are feasible when you're running your own servers. Simple shit that's just "add two lines to the nginx config and do an apt-install" becomes three sprints of custom work or whatever, or just doesn't happen because it'd be too expensive. I don't get why someone would give that stuff up unless they really, really had to.

[EDIT EDIT] I get that this rant is more about "the cloud" than distributed systems per se, but trying to build "cloud native" is the way that most orgs accidentally end up dealing with distributed systems in a much bigger way than they have to.

replies(10): >>43197578 #>>43197608 #>>43197740 #>>43199134 #>>43199560 #>>43201628 #>>43201737 #>>43202751 #>>43204072 #>>43225726 #

2. throwawaymaths ◴[27 Feb 25 19:27 UTC] No.43197578[source]▶

>>43196899 (TP) #

the minute you have a client (browser, e.g.) and a server you're doing a distributed system and you should be thinking a little bit about edge cases like loss of connection, incomplete tx. a lot of the goto protocols (tcp, http, even stuff like s3) are built with the complexities of distributed systems in mind so for most basic cases, a little thought goes a long way. but you get weird shit happening all the time (that may be tolerable) if you don't put any effort into it.

3. jimbokun ◴[27 Feb 25 19:31 UTC] No.43197608[source]▶

>>43196899 (TP) #

Distributed or not is a very binary function. If you can run in one large server, great, just write everything in non-distributed fashion.

But once you need that second server, everything about your application needs to work in distributed fashion.

replies(2): >>43198610 #>>43228522 #

4. dekhn ◴[27 Feb 25 19:45 UTC] No.43197740[source]▶

>>43196899 (TP) #

I am always happy when I can take a system that is based on distributed computing, and convert it to a stateless single machine job that runs just as quickly but does not have the complexity associated with distributed computing.

Reccently I was going to do a fairly big download of a dataset (45T) and when I first looked at it, figured I could shard the file list and run a bunch of parallel loaders on our cluster.

Instead, I made a VM with 120TB storage (using AWS with FSX) and ran a single instance of git clone for several days (unattended; just periodically checking in to make sure that git was still running). The storage was more than 2X the dataset size because git LFS requires 2X disk space. A single multithreaded git process was able to download at 350MB/sec and it finished at the predicted time (about 3 days). Then I used 'aws sync' to copy the data back to s3, writing at over 1GB/sec. When I copied the data between two buckets, the rate was 3GB/sec.

That said, there are things we simply can't do without distributed computing because there are strong limits on how many CPUs and local storage can be connected to a single memory address space.

replies(1): >>43198773 #

5. th0ma5 ◴[27 Feb 25 21:18 UTC] No.43198610[source]▶

>>43197608 #

I wish I could upvote you again. The complexity balloons when you try to adapt something that wasn't distributed, and often things can be way simpler and more robust if you start with a distributed concept.

replies(1): >>43207352 #

6. achierius ◴[27 Feb 25 21:37 UTC] No.43198773[source]▶

>>43197740 #

My wheelhouse is lower on the stack, so I'm curious as to what you mean by "stateless single machine job" -- do you just mean that it runs from start to end, without options for suspension/migration/resumption/etc.?

replies(1): >>43199143 #

7. whstl ◴[27 Feb 25 22:15 UTC] No.43199134[source]▶

>>43196899 (TP) #

I share your opinions, and really enjoyed your rant.

But it's funny. The transition to distributed/cloud feels like the rush to OOP early in my career. All of a sudden there were certain developers who would claim it was impossible to ship features in procedural codebases, and then proceed to make a fucking mess out of everything using classes, completely misunderstanding what they were selling.

It is also not unlike what Web-MVC felt like in the mid-2000s. Suddenly everything that came before was considered complete trash by some people that started appearing around me. Then the same people disparaging the old ways started building super rigid CRUD apps with mountains of boilerplate.

(Probably the only thing I was immediately on board with was the transition from desktop to web, because it actually solved more problems than it created. IMO, IME and YYMV)

Later we also had React and Docker.

I'm not salty or anything: I also tried and became proficient in all of those things. Including microservices and the cloud. But it was more out of market pressure than out of personal preference. And like you said, it has a place when it's strictly necessary.

But now I finally do mostly procedural programming, in Go, in single servers.

replies(1): >>43199926 #

8. dekhn ◴[27 Feb 25 22:16 UTC] No.43199143{3}[source]▶

>>43198773 #

it's a pretty generic term but in my mind I was thinking of a job that ran on a machine with remote attached storage (EBS, S3, etc); the state I meant was local storage.

9. FpUser ◴[27 Feb 25 23:03 UTC] No.43199560[source]▶

>>43196899 (TP) #

This is part of what I do for living. C++ backend software running on real hardware which is currently insanely powerful. There is of course spare standby in case things go South. Works like a charm and I have yet to have a client that scratched it anywhere close to overloading server.

I understand that it can not deal with FAANG scale problems, but those are relevant only to a small subset of businesses.

replies(1): >>43200244 #

10. sakesun ◴[27 Feb 25 23:48 UTC] No.43199926[source]▶

>>43199134 #

Your comment inspire me to brush up my Delphi skill.

11. intelVISA ◴[28 Feb 25 00:27 UTC] No.43200244[source]▶

>>43199560 #

The highly profitable, self-inflicted problem of using 200 QPS Python frameworks everywhere.

12. tayo42 ◴[28 Feb 25 04:16 UTC] No.43201628[source]▶

>>43196899 (TP) #

This rant misses two things that people always miss

On distributed. Qps scaling isn't the only reason and I suspect rarely the reason. It's mostly driven by availability needs.

It's also driven my organizational structure and teams. Two teams don't need to be fighting over the same server to deploy their code. So it gets broken out into services with clear api boundaries.

And ssh to servers might be fine for you. But systems and access are designed to protect the bottom tier of employees that will mess things up when they tweak things manually. And tweaking things by hand isn't reproducible when they break.

replies(2): >>43201784 #>>43202839 #

13. Karrot_Kream ◴[28 Feb 25 04:44 UTC] No.43201784[source]▶

>>43201628 #

Horizontal scaling is also a huge cost savings. If you can run your application with a tiny VM most of the time and scale it up when things get hot, then you save money. If you know your service is used during business hours you can provision extra capacity during business hours and release that capacity during off hours.

14. motorest ◴[28 Feb 25 07:31 UTC] No.43202751[source]▶

>>43196899 (TP) #

> I've found the rush to distributed computing when it's not strictly necessary kinda baffling.

I'm not entirely sure you understand the problem domain, or even the high-level problem. The is or ever was a "rush" to distributed computing.

What you actually have is this global epifany that having multiple computers communicating over a network to do something actually has a name, and it's called distributed computing.

This means that we had (and still have) guys like you who look at distributed systems and somehow do not understand they are looking at distributed systems. They don't understand that mundane things like a mobile app supporting authentication or someone opening a webpage or email is a distributed system. They don't understand that the discussion on monolith vs microservices is orthogonal to the topic of distributed systems.

So the people railing against distributed systems are essentially complaining about their own ignorance and failure to actually understand the high-level problem.

You have two options: acknowledge that, unless you're writing a desktop app that does nothing over a network, odds are every single application you touch is a node in a distributed system, or keep fooling yourself into believing it isn't. I mean, if a webpage fails to load then you just hit F5, right? And if your app just fails to fetch something from a service you just restart it, right? That can't possibly be a distributed system, and those scenarios can't possibly be mitigated by basic distributed computing strategies, isn't it?

Everything is simple to those who do not understand the problem, and those who do are just making things up.

replies(1): >>43206886 #

15. ◴[28 Feb 25 07:43 UTC] No.43202839[source]▶

>>43201628 #

16. ahartmetz ◴[28 Feb 25 10:37 UTC] No.43204072[source]▶

>>43196899 (TP) #

>It's like the whole industry went collectively insane at the same time.

Welcome to computing.

- OOP will solve all of our problems

- P2P will solve all of our problems

- XML will solve all of our problems

- SOAP will solve all of our problems

- VMs will solve all of our problems

- Ruby on Rails and by extension dynamically typed languages will solve all of our problems

- Docker [etc...]

- Functional programming

- node.js

- Cloud

- Kubernetes

- Statically typed languages

- "Serverless"

- Rust?

- AI

Some have more merit (IMO notably FP, static typing and Rust), some less (notably XML and SOAP)...

17. lucyjojo ◴[28 Feb 25 15:44 UTC] No.43206886[source]▶

>>43202751 #

you and the guy you are answering too are not talking the same language (technically yes but you are putting different meanings to the same words).

this would lead to a pointless conversation, if it were to ever happen.

replies(1): >>43217147 #

18. CogitoCogito ◴[28 Feb 25 16:21 UTC] No.43207352{3}[source]▶

>>43198610 #

I couldn't disagree more. My principle is to write systems extremely simply and then distribute portions of it as it becomes necessary. Almost always it never becomes necessary and the rare cases it does, it is entirely straight forward to do so unless you have an over-complicated design. I don't think I've ever seen it done well when done in the opposite direction. It's always cost more in time and effort and resulted in something worse.

replies(1): >>43216746 #

19. th0ma5 ◴[01 Mar 25 07:11 UTC] No.43216746{4}[source]▶

>>43207352 #

Tons of vendors offer cloud first, distributed deployments. Erlang is distributed by default. Spark is distributed by default. Most databases are distributed by default.

replies(1): >>43228529 #

20. motorest ◴[01 Mar 25 08:10 UTC] No.43217147{3}[source]▶

>>43206886 #

> you and the guy you are answering too are not talking the same language (technically yes but you are putting different meanings to the same words).

That's the point, isn't it? It's simply wrong to assert that there's a rush to distributed systems when they are already ubiquitous in the real world, even if this comes as a surprise to people like OP. Get acquainted with the definition of distributed computing, and look at reality.

The only epiphany taking place is people looking at distributed systems and thinking that, yes, perhaps they should be treated as distributed systems. Perhaps the interfaces between multiple microservices are points of failure, but replacing them with a monolith does not make it less of a distributed system. Worse, taking down your monolith is also a failure mode, one with higher severity. How do you mitigate that failure mode? Well, educate yourself about distributed computing.

If you look at a distributed system and call it something other than distributed system, are you really speaking a different language, or are you simply misguided?

21. icedchai ◴[02 Mar 25 00:23 UTC] No.43225726[source]▶

>>43196899 (TP) #

I've seen this as well. A relatively simple application becomes a mess of terraform configuration for CloudFront, Lambda, API Gateway, S3, RDS and a half dozen other lesser services because someone had an obsession with "serverless." And performance is worse. And there's as much Terraform as there is actually application code.

↑