Why some DVLA digital services don't work at night

(dafyddvaughan.uk)

124 points edent | 2 comments | 12 Jan 25 20:20 UTC | HN request time: 0.414s | source

Show context

mike_hearn ◴[16 Jan 25 15:35 UTC] No.42726661[source]▶

tl;dr same reason other services go offline at night: concurrency is hard and many computations aren't thread safe, so need to run serially against stable snapshots of the data. If you don't have a database that can provide that efficiently you have no choice but to stop the flow of inbound transactions entirely.

Sounds like Dafydd did the right thing in pushing them to deliver some value now and not try to rebuild everything right away. A common mistake I've seen some people make is assuming that overnight batch jobs that have to shut down the service are some side effect of using mainframes, and any new system that uses newer tech won't have that problem.

In reality getting rid of those kinds of batch jobs is often a hard engineering project that requires a redesign of the algorithms or changes to business processes. A classic example is in banking where the ordering of these jobs can change real world outcomes (e.g. are interest payments made first and then cheques processed, or vice-versa?).

In other cases it's often easier for users to understand a system that shuts down overnight. If the rule is "things submitted by 9pm will be processed by the next day" then it's easy to explain. If the rule is "you can submit at any time and it might be processed by the next day", depending on whether or not it happens to intersect the snapshot taken at the start of that particular batch job, then that can be more frustrating than helpful.

Sometimes the jobs are batch just because of mainframe limitations and not for any other reason, those can be made incremental more easily if you can get off the mainframe platform to begin with. But that requires rewriting huge amounts of code, hence the popularity of emulators and code transpilers.

replies(3): >>42726889 #>>42726950 #>>42735550 #

abigail95 ◴[16 Jan 25 15:50 UTC] No.42726889[source]▶

>>42726661 #

Do you know why the downtime window hasn't been decreasing over time as it gets deployed onto faster hardware over the years?

Nobody would care or notice if this thing had 99.5% availability and went read only for a few minutes per day.

replies(4): >>42727036 #>>42727102 #>>42733233 #>>42736529 #

mike_hearn ◴[16 Jan 25 15:59 UTC] No.42727036[source]▶

>>42726889 #

It doesn't get deployed onto faster hardware. Mainframes haven't really got faster.

replies(3): >>42727094 #>>42727131 #>>42730160 #

abigail95 ◴[16 Jan 25 16:05 UTC] No.42727131[source]▶

>>42727036 #

It must be. Maintaining the original hardware would be more expensive that upgrading to compatible but faster systems.

replies(1): >>42727515 #

1. mike_hearn ◴[16 Jan 25 16:33 UTC] No.42727515[source]▶

>>42727131 #

What compatible systems? Mainframes are maintained in more or less their original state by teams from IBM. They are designed to be single machines that scale vertically and never shut down, every component can be hot-swapped including CPUs but IBM charge a lot for CPU capacity if I recall correctly. Given that nighttime doesn't get shorter, the DVLA probably don't see much reason to pay a lot more for a slightly smaller window.

And mainframes from the 80s are slow. It sounds like they're running on the original.

replies(1): >>42729075 #

2. ndriscoll ◴[16 Jan 25 18:36 UTC] No.42729075[source]▶

>>42727515 (TP) #

Newer mainframes are still faster than older mainframes, and can have hundreds of cores and 10s of TB of RAM. A big part of IBM's draw is that they make modern systems that will continue to run your software forever with no modifications. I had an older guy there tell me a story about them changing a default in some ISPF panel, and customers complained enough that they had to change it back. Their storage systems have a virtualization layer for old programs that send commands to move the heads of a drive that hasn't been manufactured for 55 years or whatever and translate that to use storage backed by a modern RAID with normal disks. The engineers in the mainframe groups know who their customer base is and what they want.

It's unlikely that they're literally using 40 year old hardware since the replacement parts for that would be a nightmare to find and almost certainly more expensive than a compatible new machine.

↑