Most active commenters
  • kristopolous(4)
  • malfist(3)

←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)
589 points greghn | 23 comments | | HN request time: 1.861s | source | bottom
Show context
siliconc0w ◴[] No.39444011[source]
Core count plus modern nvme actually make a great case for moving away from the cloud- before it was, "your data probably fits into memory". These are so fast that they're close enough to memory so it's "your data surely fits on disk". This reduces the complexity of a lot of workloads so you can just buy a beefy server and do pretty insane caching/calculation/serving with just a single box or two for redundancy.
replies(3): >>39444040 #>>39444175 #>>39444225 #
malfist ◴[] No.39444175[source]
I keep hearing that, but that's simply not true. SSDs are fast, but they're several orders of magnitude slower than RAM, which is orders of magnitude slower than CPU Cache.

Samsung 990 Pro 2TB has a latency of 40 μs

DDR4-2133 with a CAS 15 has a latency of 14 nano seconds.

DDR4 latency is 0.035% of one of the fastest SSDs, or to put it another way, DDR4 is 2,857x faster than an SSD.

L1 cache is typically accessible in 4 clock cycles, in 4.8 ghz cpu like the i7-10700, L1 cache latency is sub 1ns.

replies(5): >>39444275 #>>39444384 #>>39447096 #>>39448236 #>>39453512 #
1. LeifCarrotson ◴[] No.39444384[source]
I wonder how many people have built failed businesses that never had enough customer data to exceed the DDR4 in the average developer laptop, and never had so many simultaneous queries it couldn't be handled by a single core running SQLite, but built the software architecture on a distributed cloud system just in case it eventually scaled to hundreds of terabytes and billions of simultaneous queries.
replies(5): >>39444867 #>>39444883 #>>39445536 #>>39445790 #>>39448007 #
2. Repulsion9513 ◴[] No.39444867[source]
A LOT... especially here.
3. malfist ◴[] No.39444883[source]
I totally hear you about that. I work for FAANG, and I'm working on a service that has to be capable of sending 1.6m text messages in less than 10 minutes.

The amount of complexity the architecture has because of those constraints is insane.

When I worked at my previous job, management kept asking for that scale of designs for less than 1/1000 of the throughput and I was constantly pushing back. There's real costs to building for more scale than you need. It's not as simple as just tweaking a few things.

To me there's a couple of big breakpoints in scale:

* When you can run on a single server

* When you need to run on a single server, but with HA redundancies

* When you have to scale beyond a single server

* When you have to adapt your scale to deal with the limits of a distributed system, i.e. designing for DyanmoDB's partition limits.

Each step in that chain add irrevocable complexity, adds to OE, adds to cost to run and cost to build. Be sure you have to take those steps before you decide too.

replies(3): >>39446187 #>>39446459 #>>39446823 #
4. icedchai ◴[] No.39445536[source]
Many. I regularly see systems built for "big data", built for scale using "serverless" and some proprietary cloud database (like DynamoDB), storing a few hundred megabytes total. 20 years ago we would've built this on PHP and MySQL and called it a day.
5. Szpadel ◴[] No.39445790[source]
In may day job I often see systems that have the opposite. Especially for database queries, developers tested on local machine with 100s of records and everything was quick and snappy and on production with mere millions of records I often see queries taking minutes up to a hour just because some developer didn't see need for creating indexes or created query in a way there is no way to even create any index that would work
replies(2): >>39446670 #>>39452118 #
6. disqard ◴[] No.39446187[source]
I'm trying to guess what "OE" stands for... over engineering? operating expenditure? I'd love to know what you meant :)
replies(2): >>39446438 #>>39447039 #
7. madisp ◴[] No.39446438{3}[source]
probably operating expenses
8. kuschku ◴[] No.39446459[source]
Maybe I'm misunderstanding something, but that's about 2700 a second. Or about 3Mbps.

Even a very unoptimized application running on a dev laptop can serve 1Gbps nowadays without issues.

So what are the constraints that demand a complex architecture?

replies(1): >>39448463 #
9. layer8 ◴[] No.39446670[source]
That’s true, but has little to do with distributed cloud architecture vs. single local instance.
10. goguy ◴[] No.39446823[source]
That really doesn't require that much complexity.

I used to send something like 250k a minute complete with delivery report processing from a single machine running a bunch of other services like 10 years ago.

replies(1): >>39448654 #
11. malfist ◴[] No.39447039{3}[source]
Sorry, thought it was a common term. Operational Excellence. All the effort and time it takes to keep a service online, on call included
12. kristopolous ◴[] No.39448007[source]
You're not considered serious if you don't. Kinda stupid.
replies(1): >>39448616 #
13. rdoherty ◴[] No.39448463{3}[source]
I'm not the OP but a few things:

* Reading/fetching the data - usernames, phone number, message, etc.

* Generating the content for each message - it might be custom per person

* This is using a 3rd party API that might take anywhere from 100ms to 2s to respond, and you need to leave a connection open.

* Retries on errors, rescheduling, backoffs

* At least once or at most once sends? Each has tradeoffs

* Stopping/starting that many messages at any time

* Rate limits on some services you might be using alongside your service (network gateway, database, etc)

* Recordkeeping - did the message send? When?

replies(2): >>39452358 #>>39452892 #
14. nine_k ◴[] No.39448616[source]
In the startup world, this is correct.

The success that VCs are after is when your customer base doubles every month. Better yet, every week. Having a reasonably scalable infra at the start ensures that a success won't kill you.

Of course, the chances of a runaway success like this are slim, so 99% or more startups overbuild, given their resulting customer base. But it's like 99% or more pilots who put on a parachute don't end up using it; the whole point is the small minority who do, and you never know.

For a stable, predictable, medium-scale business it may make total sense to have a few dedicated physical boxes and run their whole operation from them comfortably, for a fraction of cloud costs. But starting with it is more expensive than starting with a cloud, because you immediately need an SRE, or two.

replies(1): >>39449301 #
15. nine_k ◴[] No.39448654{3}[source]
Nice.

But average latency is not the whole picture; tail latency is. For good tail latency and handling of spikes, you have to have a sizable "untapped" reserve of performance.

16. kristopolous ◴[] No.39449301{3}[source]
You aren't going to get there. The risks and complexity of a startup are high to begin with. Adding artificial roadblocks because of aspirational fantasies is going to hold you back.

Look at the big successes such as youtube, twitter, facebook, airbnb, lyft, google, yahoo - exactly zero of them did this preventatively. Even altavista and babelfish, done by DEC and running on Alphas, which they had plenty of, had to be redone multiple times due to growth. Heck, look at the first 5 years of Amazon. AWS was initially ideated in a contract job for Target.

Address the immediate and real needs and business cases, not pie in the sky aspirations of global dominance - wait until it becomes a need and then do it.

The chances of getting there are only reasonable if you move instead of plan, otherwise you'll miss the window and product opportunity.

I know it ruffles your engineering feathers - that's one of the reasons most attempts at building these things fails. The best ways feel wrong, are counterintuitive and are incidentally often executed by young college kids who don't know any better. It's why successful tech founders tend to be inexperienced; it can actually be advantageous if they make the right "mistakes".

Forget about any supposedly inevitable disaster until it's actually affecting your numbers. I know it's hard but the most controllable difference between success and failure in the startup space is in the behavioral patterns of the stakeholders.

replies(1): >>39449864 #
17. esafak ◴[] No.39449864{4}[source]
Do you remember the companies that did not scale? friendster did well until it failed to scale, and Facebook took over.

So the converse argument might be: don't bungle it up because you failed to plan. Provision for at least 10x growth with every (re-)implementation.

https://highscalability.com/friendster-lost-lead-because-of-...

replies(1): >>39450384 #
18. kristopolous ◴[] No.39450384{5}[source]
Hold on... You think Facebook took over from Friendster because of scaling problems?!

MySpace was the one that took the lead over Friendster and it withered after it got acquired for $500 million by news corp because that was the liquidity event. That's when Facebook gained ground. Your timeline is wrong.

The MySpace switch was because of themes and other features the users found more appealing. Twitter had similar crashes with its fail whale for a long time and they survived it fine. The teen exodus of Friendster wasn't because of TTLB waterfall graphs.

Also MySpace did everything on cheap Microsoft IIS 6 servers in ASP 2.0 after switching from Coldfusion in Macromedia HomeSite, they weren't genuises. It was a knockoff created by amateurs with a couple new twists. (A modern clone has 2.5 mil users: see https://spacehey.com/browse still mostly teenagers)

Besides, when the final Friendster holdout of the Asian market had exponential decline in 2008, the scaling problems of 5 years ago had long been fixed. Faster load times did not make up for a product consumers no longer found compelling.

Also Facebook initially was running literally out of Mark's dorm room. In 2007, after they had won the war, their code got leaked because they were deploying the .svn directory in their deploy strategy. Their code was widely mocked. So there we are again.

I don't care if you can find someone who agrees with you on the Friendster scaling thing, almost every collapsed startup has someone that says "we were just too successful and couldn't keep up" because thinking you were just too awesome is the gentler on the ego than realizing a bunch of scrappy hackers just gave people more of what they wanted and either you didn't realize it or you thought your lack of adaption was a virtue.

replies(1): >>39451064 #
19. esafak ◴[] No.39451064{6}[source]
How sure are you that they switched because of themes? Did you see user research? I left because of its poor performance, and MySpace was no substitute for friendster; it targeted an artsy demographic. But Facebook was.
replies(1): >>39451267 #
20. kristopolous ◴[] No.39451267{7}[source]
Yes. I worked in social networks 15 years ago. It was a heavy research topic for me.

You're a highly technical user. Non-technical people are weird - part of the MySpace exodus was the belief that it spread "computer viruses", really

There was more to the switches but I'd have to dredge it up probably through archive sites these days. The reasons the surveys supported I considered ridiculous but it doesn't matter it's better to understand consumer behavior - we can't easily change it.

Especially these days. It was not possible for me to be a teenager with high speed wi-fi when I was one 30 years ago. I've got near zero understanding of the modern consumer youth market or what they think. Against all my expectations I've become an old person.

Anyways, the freeform HTML was a major driver - it was geocities with less effort, which had also exited through a liquidity event and currently has a clone these days https://neocities.org/browse

21. pooper ◴[] No.39452118[source]
This is a different topic and not always a skills issue. The stupid match for "productivity" and "velocity" means you have to cut corners.

Also Sometimes, it is poor communication. Just yesterday I saw some code that requests auth token before every request even though each bearer token comes with expires in (about twelve hours).

22. joshstrange ◴[] No.39452358{4}[source]
I literally spent the last week speccing out a system just like this and you are completely correct. You’ve touched on almost every single thing we ran into.
23. kuschku ◴[] No.39452892{4}[source]
Oh, I absolutely agree that the complexity is in these topics. I'm just sceptic that they're enough to turn a task that could run on a laptop into one that requires an entire cluster of machines.

The third party API is the part that has the potential to turn this straightforward task into a byzantine mess, though, so I suspect that's the missing piece of information.

I'm comparing this to my own experience with IRC, where handling the same or larger streams of messages is common. And that's with receiving this in real time, storing the messages, matching and potentially reacting to them, and doing all that while running on a raspberry pi.