The Future of Big Iron: An Interview with IBM’s Christian Jacobi

(morethanmoore.substack.com)

84 points rbanffy | 1 comments | 15 Oct 24 09:13 UTC | HN request time: 0s | source

Show context

froh ◴[15 Oct 24 20:48 UTC] No.41852945[source]▶

Jacobi is one of 70 IBM Fellows (think IBM internal professors, free reign over a research budget, you gain the title with technical prowess plus business acumen)

at the heart of the Mainframe success is this:

> I’d say high-availability and resiliency means many things, but in particular, two things. It means you have to catch any error that happens in the system - either because a transistor breaks down due to wear over the lifetime, or you get particle injections, or whatever can happen. You detect the stuff and then you have mechanisms to recover. You can't just add this on top after the design is done, you have to be really thinking about it from the get-go.

and then he goes into details how that is achieved. the article nicely goes into some details.

oh and combine the 99.9999999% availability "nine nines" with insane throughput. as in real time phone wiretapping throughput, or real time mass financial transactions, of course.

or a web server for an online image service.

or "your personal web server in a mouse click", sharing 10.000 such virtual machines on a single physical machine. which has a shared read only /ist partition mounted into all guests. not containers, no, virtual machines, in ca 2006...

"don't trust a computer you can lift"

replies(3): >>41853129 #>>41861040 #>>41878681 #

wolf550e ◴[15 Oct 24 21:06 UTC] No.41853129[source]▶

>>41852945 #

The amount of throughput you can get out of AMD EPYC zen5 servers for the price of a basic mainframe is insane. Even if IBM wins in single core perf using absurd amount of cache and absurd cooling solution, the total rack throughput is definitely won by "commodity" hardware.

replies(2): >>41853447 #>>41853880 #

neverartful ◴[15 Oct 24 22:41 UTC] No.41853880[source]▶

>>41853129 #

These comments always come up with every mainframe post. It's not only about performance. If it were it would be x86 or pSystems (AIX/POWER). The reason customers buy mainframes is RAS (reliabililty, availability, scalability). Notice that performance is not part of RAS.

replies(1): >>41854043 #

jiggawatts ◴[15 Oct 24 23:09 UTC] No.41854043[source]▶

>>41853880 #

You and the parent are both "missing the point", which is sadly not talked about by the manufacturer either (IBM).

I used to work for Citrix, which is "software that turns Windows into a mainframe OS". Basically, you get remote thin terminals the same as you would with an IBM mainframe, but instead of showing you green text you get a Windows desktop.

Citrix used to sell this as a "cost saving" solution that inevitably would cost 2-3x the same as traditional desktops.

The real benefit for both IBM mainframes and Citrix is: latency.

You can't avoid the speed of light, but centralising data and compute into "one box" or as close as you can get it (one rack, one data centre, etc...) provides enormous benefits to most kinds of applications.

If you have some complex business workflow that needs to talk to dozens of tables in multiple logical databases, then having all of that unfold in a single mainframe will be faster than if it has to bounce around a network in a "modern" architecture.

In real enterprise environments (i.e.: not a FAANG) any traffic that has to traverse between servers will typically use 10 Gbps NICs at best (not 100 Gbps!), have no topology optimisation of any kind, and flow through at a minimum one load balancer, one firewall, one router, and multiple switches.

Within a mainframe you might have low double-digit microsecond latencies between processes or LPARs, across an enterprise network between services and independent servers its not unusual to get well over one millisecond -- one hundred times slower.

This is why mainframes are still king for many orgs: They're the ultimate solution for dealing with speed-of-light delays.

PS: I've seen multiple attempts to convert mainframe solutions to modern "racks of boxes" and it was hilarious to watch the architects be totally mystified as to why everything was running like slow treacle when on paper the total compute throughput was an order of magnitude higher than the original mainframe had. They neglected latency in their performance modelling, that's why!

replies(3): >>41854112 #>>41854634 #>>41854691 #

le-mark ◴[16 Oct 24 01:07 UTC] No.41854691[source]▶

>>41854043 #

I’d love to read more about these projects. In particular, were they rewrites, or “rehosting”? What domain and what was the avg transaction count? Real-time or batch?

replies(1): >>41855037 #

jiggawatts ◴[16 Oct 24 02:26 UTC] No.41855037[source]▶

>>41854691 #

Citrix is almost always used to re-host existing applications. I've only ever seen very small utility apps that were purpose designed for Citrix and always as a part of a larger solution that mostly revolved around existing applications.

Note that Citrix or any similar Windows "terminal services" or "virtual desktop" product fills the same niche as ordinary web applications, except that Win32 GUI apps are supported instead of requiring a rewrite to HTML. The entire point is that existing apps can be hosted with the same kind of properties as a web app, minus the rewrite.

replies(1): >>41858952 #

le-mark ◴[16 Oct 24 13:42 UTC] No.41858952[source]▶

>>41855037 #

I was referring to mainframe migrations, sorry that wasn’t clear.

replies(1): >>41863851 #

1. jiggawatts ◴[16 Oct 24 21:04 UTC] No.41863851{3}[source]▶

>>41858952 #

I watched two such mainframe to “modern” architecture transitions recently, one at a telco and one at a medical insurance company. Both replaced what were billing and ERP systems. Both used Java on Linux virtual machines using an n-tier service oriented architecture. Both had a mix of batch and interactive modules.

Both suffered from the same issue, which is actually very common but nobody seems to know: power efficiency throttling of CPU speeds.

The irony was that the new compute platform had such a huge capacity compared to the old mainframe (20x or more) that the CPUs were only about 1% utilised. The default setting on all such servers is to turn cores off or put them into low-power modes as slow as 400 MHz. This murders performance and especially slows down the network because of the added latency of cores having to wake up from deep sleep when a packet arrives.

It was one of those situations where running a busy-loop script on each server would speed up the application because it keeps everything “awake”.

The telco doubled their capacity as an attempt to fix the issues but this took them to 0.5% utilisation and things got worse.

The health insurer also overcomplicated their network, building a ~100 server cluster as if it was the public cloud. They had nested VLANs, address translation, software defined networking, firewalls between everything, etc… Latency was predictably atrocious and the whole thing ran like it was in slow motion. They too had the CPU throttling issue until I told them about it but the network was so bad it didn’t fix the overall issue.

↑