Samsung 990 Pro 2TB has a latency of 40 μs
DDR4-2133 with a CAS 15 has a latency of 14 nano seconds.
DDR4 latency is 0.035% of one of the fastest SSDs, or to put it another way, DDR4 is 2,857x faster than an SSD.
L1 cache is typically accessible in 4 clock cycles, in 4.8 ghz cpu like the i7-10700, L1 cache latency is sub 1ns.
The amount of complexity the architecture has because of those constraints is insane.
When I worked at my previous job, management kept asking for that scale of designs for less than 1/1000 of the throughput and I was constantly pushing back. There's real costs to building for more scale than you need. It's not as simple as just tweaking a few things.
To me there's a couple of big breakpoints in scale:
* When you can run on a single server
* When you need to run on a single server, but with HA redundancies
* When you have to scale beyond a single server
* When you have to adapt your scale to deal with the limits of a distributed system, i.e. designing for DyanmoDB's partition limits.
Each step in that chain add irrevocable complexity, adds to OE, adds to cost to run and cost to build. Be sure you have to take those steps before you decide too.