Bad News: the system can't recover from an error in an individual flight plan, bringing the whole system down with it (along with the backup system since it was running the same code).
> This forced controllers to revert to manual processing, leading to more than 1,500 flight cancellations and delaying hundreds of services which did operate.
Seems "reject individual flight plan" might be a better system response than "down hard to prevent corruption"
Bad assumption that a failure to interpret a plan is a serious coding error seems to be the root cause, but hard to say for sure.
https://news.ycombinator.com/item?id=37461695 ("UK air traffic control meltdown (jameshaydon.github.io)")
From the day of:
https://news.ycombinator.com/item?id=37292406 - 33 points by woodylondon on Aug 28, 2023 (23 comments)
Discussions after:
https://news.ycombinator.com/item?id=37401864 - 22 points by bigjump on Sept 6, 2023 (19 comments)
https://news.ycombinator.com/item?id=37402766 - 24 points by orobinson on Sept 6, 2023 (20 comments)
https://news.ycombinator.com/item?id=37430384 - 34 points by simonjgreen on Sept 8, 2023 (68 comments)
https://news.ycombinator.com/item?id=37461695 ("UK air traffic control meltdown (jameshaydon.github.io)", 446 comments)
I guess we now need a "Falsehoods Programmers Believe About Aviation Data" site :)
I expect that out of any random sample 500 million literate and mentally healthy English speakers, more than 450 million of them are totally unaccustomed to thinking about nautical miles ever. Even people in science, or even who might have dealt with nanometers in school do not typically think about nautical miles unless they are sailors or airplane pilots.
When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan. After all, literally yesterday they were all functioning without the automated system, if it doesn't seem to be working right better switch back to the manual process we were all using yesterday, instead of risk a catastrophe.
In that situation, switching back to yesterday's workflow is something that won't interrupt much.
A couple decades -- or honestly even just a couple years -- later, that same fault system, left in place without much consideration because it rarely is triggered -- is itself catastrophic, switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.
The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.
It's pretty readable and quite interesting.
From the system's POV maybe this is the right way to resolve the problem. Could masking the failure by obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights? If so, maybe it's truly urgent enough to bring down the system and force the humans to resolve the discrepancy.
The systems outside of the scope of this one failed to preserve a uniqueness guarantee that was depended on by this system. Was that dependency correctly identified as one that was the job of System X and not System Y?
Related: The editorialized HN title uses nanometers (nm) when they possibly mean nautical miles (nmi). What would a flight control system make of that?
So, yes, people who have worked in a related field get it. Still annoying though.
I've never had cause to see the abbreviated form of nautical miles, but I know nanometers. Also, given purely the title I could see it being some kind of data collision due to precision errors between to locations that should be the same airport but perhaps two different sensors.
'nm' and 'NM' are the accepted abbreviations for nautical miles in the aviation industry, whether official or not.
But this is Hacker News, not Aviation News, and there are plenty of people, like me, who might find this interesting but aren't in aviation. I also thought it meant nanometers.
The article title is talking about location data and computers, I've seen many people forget floating point precision when comparing and getting bit by tiny differences at the 10^-9 or smaller. That seems just as obvious on the outset as non-unique location designations in what the average person would assume to be a dataset that's intentionally unique and unambiguous.
Exactly, which makes the headline particularly intriguing—how could a 3600 nanometer difference matter? The standard resolution, which I pursued, is to read the article to find out, but it doesn't mention the distance at all.
I'm aware closing tickets of "future investigation" tasks when it seems to not be an issue any longer is common. But, it shouldnt be.
I'd rather deal with designing tables to properly represent names.
Increased number of injuries but not deaths could be, for example, (purely making things up off the top of my head here) due to higher levels of distractedness among average drivers due to fear of terrorism, which results in more low-speed, surface-street collisions, while there’s no change in high speed collisions because a short spell of distractedness on the highway is less likely to result in an accident.
CORRECT the flight plan, by first promoting the exit/entry points for each autonomous region along the route, validating the entry/exit list only, and then the arcs within, would be the least errant method.
They are understood not to be. They are generally known to be regionally unique.
The "DVL" code is unique with-in FAA/Transport Canada control, and the "DVL" is unique with-in EASA space.
There are pre-defined three-letter codes:
* https://en.wikipedia.org/wiki/IATA_airport_code
And pre-defined four-letter codes:
* https://en.wikipedia.org/wiki/ICAO_airport_code
There are also five-letter names for major route points:
* https://data.icao.int/icads/Product/View/98
* https://ruk.ca/content/icao-icard-and-5lnc-how-those-5-lette...
If there are duplicates there is a resolution process:
* https://www.icao.int/WACAF/Documents/Meetings/2014/ICARD/ICA...
Let's look at point 2.28: "Several factors made the identification and rectification of the failure more protracted than it might otherwise have been. These include:
• The Level 2 engineer was rostered on-call and therefore was not available on site at the time of the failure. Having exhausted remote intervention options, it took 1.5 hours for the individual to arrive on-site to perform the necessary full system re-start which was not possible remotely.
• The engineer team followed escalation protocols which resulted in the assistance of the Level 3 engineer not being sought for more than 3 hours after the initial event.
• The Level 3 engineer was unfamiliar with the specific fault message recorded in the FPRSA-R fault log and required the assistance of Frequentis Comsoft to interpret it.
• The assistance of Frequentis Comsoft, which had a unique level of knowledge of the AMS-UK and FPRSA-R interface, was not sought for more than 4 hours after the initial event.
• The joint decision-making model used by NERL for incident management meant there was no single post-holder with accountability for overall management of the incident, such as a senior Incident Manager.
• The status of the data within the AMS-UK during the period of the incident was not clearly understood.
• There was a lack of clear documentation identifying system connectivity.
• The password login details of the Level 2 engineer could not be readily verified due to the architecture of the system."
WHAT DOES "PASSWORD LOGIN DETAILS ... COULD NOT BE READILY VERIFIED" MEAN?
EDIT: Per NATS Major Incident Investigation Final Report - Flight Plan Reception Suite Automated (FPRSA-R) Sub-system Incident 28th August 2023 https://www.caa.co.uk/publication/download/23340 (PDF) ... "There was a 26-minute delay between the AMS-UK system being ready for use and FPRSA-R being enabled. This was in part caused by a password login issue for the Level 2 Engineer. At this point, the system was brought back up on one server, which did not contain the password database. When the engineer entered the correct password, it could not be verified by the server. "
Here’s a previous thread where someone thought it was absurd that there could exist native English speakers who don’t regularly go shopping, and treated that supposed impossibility as a huge “checkmate”!
Software can (maybe) be perfect, or it can be relevant to a large user base. It cannot be both.
With an enormous budget and a strictly controlled scope (spacecraft) it may be possible to achieve defect-free software.
In most cases it is not. There are always finite resources, and almost always more ideas than it takes time to implement.
If you are trying to make money, is it worth chasing down issues that affect a miniscule fraction of users that take eng time which could be spent on architectural improvements, features, or bugs affecting more people?
If you are an open source or passion project, is it worth your contributors' limited hours, and will trying to insist people chase down everything drive your contributors away?
The reality in any sufficiently large project is that the bug database will only grow over time. If you leave open every old request and report at P3, users will grow just as disillusioned as if you were honest and closed them as "won't fix". Having thousands of open issues that will never be worked on pollutes the database and makes it harder to keep track of the issues which DO matter.
The solution to this is to trigger all functionality periodically and randomly to ensure it remains tested. If you don't test your backups, you don't have any.
I assumed IATA messed up, not I'm wondering how that even happens. It's not even easy to discover the local codes of remote aviation authorities.
In fact, if you say "miles", you mean nautical miles. You have to use "sm" to mean statute miles if you're using that unit, which is often used for measuring visibility.
From Sept 2023 (flightglobal.com):
- Comments: https://news.ycombinator.com/item?id=37430384
Also some more detailed analysis:
- https://jameshaydon.github.io/nats-fail/
- Comments: https://news.ycombinator.com/item?id=37461695
I've worked on a (medical, not aviation) system where we tried as much as possible to recover from subsystem failures or at least gracefully reduce functionality until it was safe to shut everything down.
However, there were certain classes of failure where the safest course of action was to shut the entire system down immediately. This was generally the case where continuing to run could have made matters worse, putting patient safety at risk. I suspect that the designers of this system ran into the same problem.
https://www.ecfr.gov/current/title-14/chapter-I/subchapter-A...
In fact, I can't see how it follows from the rest.
Software can have defects, true. There are finite resources, true. So keep the tickets open. Eventually someone will fix them.
Closing something for spurious psychological reasons seems detrimental to actual engineering and it doesn't actually avoid any real problem.
Let me repeat that: ignoring a problem doesn't make it disappear.
Keep the tickets open.
Anything else is supporting a lie.
> Investigators probing the serious UK air traffic control system failure in August last year [...]
=3
Inches of mercury, magnetic bearings (the magnetic poles move! but they put up with that) and gallons of fuel, all just accepted.
Got a safety-of-life emergency on an ocean liner, oil tanker or whatever? Everywhere in the entire world mandates GMDSS which includes Digital Selective Calling, the boring but complicated problems with radio communication are solved by a machine, you just need to know who you want to talk to (for Mayday calls it's everyone) and what you want to tell them (where you are, that you need urgent assistance and maybe the nature of the emergency)
On an big plane? Well good luck, they only have analogue radio and it's your problem to cope with the extensive troubles as a result.
I'm actually impressed that COSPAS/SARSAT wasn't obliged to keep the analogue plane transmitters working, despite obsoleting (and no longer providing rescue for) analogue boat or personal transmitters. But on that, at least, they were able to say no, if you don't want to spend a few grand on the upgrade for your million dollar plane we don't plan to spend billions of dollars to maintain the satellites just so you can keep your worse system limping along.
That would be roughly consistent with the title and not a totally absurd thing to happen in the world.
/* This should never happen */
if (waypoints.matchcount > 2) {
Best I can see (using Rust) is a hashmap on UTF-8 string keys and every code in existence gets inserted into the hash map with an enum struct based on the code type. So you are forced to switch over each enum case and handle each case no matter what region code type.
It becomes apparent that the problem must be handled with app logic earlier in the system; to query a database of codes, you must also know which code and "what type" of code it is. Users are going to want to give the code only, so there's some interesting mis-direction introduced; the system has to somehow fuzzy match the best code for the itinerary. Correct me if i'm wrong, but the above seems like a mandatory step in solving the problem which would have caught the exception.
I echo other comments that say that there's probably 60% more work involved than your manager realizes.
Look, when you're barking orders at the guys in the trenches who, understandably in fear for their jobs, do the stupid "business-smart" thing, then it is entirely the fault of management.
I can't tell you how many times just in the last year I've been blamed-by-proxy for doing something that was decreed upon me by some moron in a corner office. Everything is an emergency, everything needs to be done yesterday, everything is changing all the time because King Shit and his merry band of boot-licking middle managers decide it should be.
Software engineers, especially ones with significant experience, are almost surely more right than middle managers. "Shouldn't we consider this case?" is almost always met with some parable about "overengineering" and followed up by a healthy dose of "that's not AGILE". I have grown so tired of this and thanks to the massive crater in job mobility most of us just do as we are told.
It's the power imbalance. In this light, all blame should fall on the manager unless it can be explicitly shown to be developer problems. The addage "those who can, do, and those who can't, teach" applies equally to management.
When it's my f@#$U neck on the line and the only option to keep my job is do the stupid thing you can bet I'll do the stupid thing. Thank god there's no malpractice law in software.
Poor you - only one of our jobs is getting shipped overseas.
There are a bunch of ways FPRSA-R can already interpret data like this correctly, but there were a combination of 6 specific criteria that hadn’t been foreseen (e.g. the duplicate waypoints, the waypoints both being outside UK airspace, the exit from UK airspace being implicit on the plan as filed, etc).
I've also seen "DANGER!! 12000000 μVolts!!!" on tiny little model railroad signs.
Flights are tracked by radar and by transponder. The appropriate thing to do is just flag the flight with a discontinuity error but otherwise operate normally. This happens with other statuses like "radio failure" or "emergency aircraft."
It's not something you'd see on a commercial flight, but a private IFR flight (one with a flight plan), you can actually cancel your IFR plan mid flight and revert to VFR (visual flight rules) instead.
Some flights take off without an IFR clearance as a VFR flight, but once airborne, they call up ATC and request an IFR clearance already en route.
The system is vouchsafing where it does not need to.
Then why aren’t they namespaced? Attach to each code its issuing authority, so it is obvious to the code that DVL@FAA and DVL@EASA are two different things?
Maybe for backward compatibility/ human factors reasons, the code needs to be displayed without the namespace to pilots and air traffic controllers, but it should be a field in the data formats.
This isn’t IATA. IATA manages codes used for passenger and cargo bookings, which are distinct from the codes used by pilots and air traffic control we are talking about here-ultimately overseen by ICAO. These codes include a lot of stuff which is irrelevant to passengers/freight, such as navigation waypoints, military airbases (which normally would never accept a civilian flight, but still could be used for an emergency landing-plus civilian and military ATC coordinate with each other to avoid conflicts)
Flagging the error is absolutely the right way to go. It should have rejected the flight plan, however. There could be issues if the flight was allowed to proceed and you now have an aircraft you didn't expect showing up.
Crashing is not the way to handle it.
I was three days in my jeans at business meetings. My bag came back through Lima, Peru and Houston. My bag was having more fun than me.
In general, the point where a problem first becomes apparent is not a guideline to its scope.
Air traffic control is inherently a coordination problem dependent on common data, rules and procedures, which would seem to limit the degree to which subsystems can be siloed. Multiple implementations would not have helped in this case, either.
If you are superstitious about bugs, it’s time to triage. Absolutely full turn disagreement with your directions
Bugs seem to scale log-linearly with code complexity. If it’s exponential you’re doing it wrong.
If one bad flight plan came in, what are the chances other unnoticed errors may be getting through?
Given the huge danger involved with being wrong shutting down with a “stuff doesn’t add up, no confidence in safe operation” error may be the best approach.
Aug 2023: “UK air traffic woes caused by 'invalid flight plan data'”
https://www.theregister.com/2023/08/30/uk_air_traffic_woes_i... --
(-11 down votes and counting)
1. The bug tracker is there to document and prioritize the list of bugs that we know about, whether or not they will ever be fixed. In this world, if it's a real issue, it's tracked and kept while it exists in the software, even though it might be trivial, difficult, or just not worth fixing. There's no such thing as closing the bug as "Won't Fix" or "Too Old". Further, there's no expectation that any particular bug is being worked on or will ever be fixed. Teams might run through the bug list periodically to close issues that no longer reproduce.
2. The bug tracker tracks engineering load: the working set of bugs that are worthy of being fixed and have a chance to be fixed. Just because the issue is real, doesn't mean it's going to be fixed. So file the bug, but it may be closed if it is not going to be worked on. It also may be closed if it gets old and it's obvious it will never get worked on. In this model, every bug in the tracker is expected to be resolved at some point. Teams will run through the bug list periodically to close issues that we've lived with for a long time and just won't be fixed ever.
I think both are valid, but as a software organization, you need to agree on which model you're using.
Originally 1 meter was one ten-millionth of the distance over the surface of the earth from the equator to the pole.
One nautical mile is the length one arc-minute of latitude along a meridian. (About 1.85km).
Everything dies including (probably) the universe, and shortly before that, our software. So you're right, the number of bugs in a specific application is ultimately finite. But most of even the oldest software still in use is still getting regular revisions, and if app code is still being written, it's safe to assume bugs are still being created by the fallible minds that conceived it. So practically speaking, for an application still in-development, the number of bugs, number of features, number of lines of code, etc. are dynamic, not finite, and mostly ever-increasing.
The UK is part of the IFPS Zone, centrally managed by EUROCONTROL using AFTM. IFPS can accept/reject IFR flight plans, but the software at NATS can't. By the time NATS gets the flight plan, it has already been accepted. All their software can do is work out which parts enter the UK's airspace. If it's a long route, the plane has already taken off.
NATS aren't even thinking of a mixed-mode approach (for IFR flight plans) where they have both automated processing and manual processing of things the automated processing can't handle. They don't have a system or processes capable of that. And until this one flight, they'd never had a flight plan the automated system couldn't handle.
The failures here were:
1) a very unlikely edge case whose processing was specified, but wasn't implemented correctly, in the vendor's processing software
2) no test case for the unlikely edge case because it was really _that_ unlikely, all experts involved in designing the spec did not imagine this could happen
3) they had the same vendor's software on both primary and secondary systems, so failover failed too; a second implementation might have succeeded where the first failed, but no guarantees
4) they had a series of incident management failures that meant they failed to fix the broken system within 4 hours, meaning NATS had to switch to manual processing of flight plans
That all programming languages, down to statically typed assembly, don’t support something as simple to validate as unit consistency says something strange about how the science of replacing unreliable manual processes with automated systems is really bad at the practice of replacing its own risky manual processes with automated systems.
If numeric types just required a given unit, without even supporting automated conversions, it would make incorrectly unit-ed/scaled literals vastly less likely.
waypoint = waypointsMatches[0]
Without even mentioning that waypointsMatches might have multiple elements.This is why I always consider [0] to be a code smell. It doesn't have a name afaik, but it should.
This is hack-on-hack stuff, but I am wondering if there is a low cost fix for a design behaviour which can't alter without every airline, every other airline system worldwide, accommodating the changes to remove 3-letter code collision.
Gate the problem. Require routing for TLA collisions to be done by hand, or be fixed in post into two paths which avoid the collision. (intrude an intermediate waypoint)
Also, old bugs can get fixed by accident / the environment changing / the whole subsystem getting replaced, and if most of your long tail of bugs is already fixed then it wastes people's time triaging it.
I agree no sexy languages have it, and almost all languages have terrible support or anti-support for correctness in numerical programming. It's very strange.
(By anti-support I mean things that waste your time and make it harder. For instance, a lot of languages think "static typing" means they need to prevent you from doing `int a,b; short c = a * b;` even if this is totally well-defined.)
"we fall over far less often now"
Thank you for neg votes kind strangers. Remember ATC is rife with historical kludges, including using whiteout on the giant green screens to make the phosphor get ignored by the light gun. This is an industry addicted to backwards compatibility to the point that you can buy dongle adapters for dot-matrix printers at every gate, such that they don't have to replace the printer but can back-end the faster network into it.
C/F rebooting a 787 inside the maximum-days-without-a-reboot
https://chaos.social/@russss/111048524540643971
Time to tick that "repeat incident?" box in the incident management system, guys.
Swift has units as part of the standard library. In the sense that matters here, Rust and C++ could also have units. It requires a level of expressiveness in the type system that most modern languages do have, if you put it to use.
int/short should be thought of as storage size optimizations for memory. They're very bad ways to specify the correct range of values for a variable.
(Ada has explicitly ranged integers though!)
Catastrophe is most likely to strike when you try to fix a small mistake: pushing a hot-fix that takes down the server; burning yourself trying to take overdone cookies from the oven; offending someone you are trying to apologize to.
Realistically, no, they won't. If the rate of new P0-P2 bugs is higher than the rate of fixing being done, then the P3 bugs will never be fixed. Certainly by the time someone gets around to trying to fix the bug, the ticket will be far enough out of date that that person will not be able to trust it. There is zero value in keeping the ticket around.
> Anything else is supporting a lie.
Now who's prioritising "spurious psychological reasons" over the things that actually matter? Closing the ticket as wontfix isn't denying that the bug exists, it's acknowledging that the bug won't be fixed. Which is much less of a lie than leaving it open.
Put another way: you're working on a new version of your product. There are 900 issues in the tracker. Is this an urgent emergency where you need to shut down feature work and stabilize?
If you keep a clean work tracker where things that are open mean work that should get done: absolutely
If you just track everything ever and started this release with 1100 issues: no, not necessarily.
But wait, of those 900 issues are there any that should block the release? Now you have 900 to go through and determine. And unless you won't fix some of them you'll have the same thing in a few months.
Work planning, triage, and other team tasks are not magic and voodoo, but on my experience the same engineers who object to the idea of "won't fix" are the ones who want to just code all day and never have to deal with the impact of a huge messy issue database on team and product.
> Originally 1 meter was one ten-millionth of the distance over the surface of the earth from the equator to the pole.
Even more originally, they wanted to use the length of a pendulum that takes one second to swing. But they discovered that this varies from place to place. So they came up with the newer definition based on the size of the earth. And just like with all the subsequent redefinitions (like the one based on the speed of light etc), the new length of the metre matches the old length of the metre:
> [The] length of the string will be approximately 993.6 millimetres, i.e. less than a centimetre short of one metre everywhere on Earth. This is because the value of g, expressed in m/s^2, is very close to π^2.
The definitions matching is by design, not an accident.
See https://en.wikipedia.org/wiki/History_of_the_metre and https://en.wikipedia.org/wiki/Seconds_pendulum
If you want something less arbitrary, you can pick 'Natural Units': https://en.wikipedia.org/wiki/Natural_units
Many programming languages are flexible and strong enough support this. We just don't do it by default, and you'd need libraries.
Btw, units by themselves are useful, but not enough. Eg angular momentum and energy have the same units of Newton * metre, but adding them up is not recommended.
When used like this it just confuses a reader with rethoric. In this case netflix is just bad at live streaming, they clearly haven't done the necessary engineering work on it.
That's not a remotely plausible model though. There are recorded cases of e.g. 1.6 seconds of distractedness at high speed causing a fatal collision. Anything that increases road injuries is almost certainly also increasing deaths in something close to proportion, but a given study size obviously has a lot more power to detect injuries than deaths.
Chubby-SRE added quarterly synthetic downtime of the global cell (iff the downtime SLA had not already been exceeded).
You keep the record of the bug, someone searching for the symptoms can find the wontfix bug. Ideally you put it in the program documentation as a known issue. You just don't keep it open, because it's never going to be worked on.
> Like what's wrong with having 1000 open bugs?
Noise, and creating misleading expectations.
Perhaps. But the triage to separate the "good bugs accurately describing real things" from the chaff isn't free either.
Pretty sure this is exactly what happened with Cruise in San Francisco, cars would just stop and await instructions causing traffic jams. City got mad so they added a "pullover" mechanism. Except now, the "pullover" mechanism ended up dragging someone who had been "flung" into the cars path by someone who had hit and run a pedestrian.
The real world will break all your test cases.
The decision between feature work vs. maintenance work in a company is driven by business needs, not by the number of bugs open in the issue tracker. If anything, keeping real bugs open helps business leaders actually determine their business needs more effectively. Closing them unfixed is the equivalent of putting your head in the sand.
That's quite a big assumption. Every company I've worked at where that was the case had terrible culture and constantly shipped buggy crap. Not really the kind of environment that I'd use to set policy or best practices.
So the rule is "Free parking on Sundays", and the exception that proves it is "Free parking on Sundays"? That's a post-hoc (circular) argument that does not convince me at all.
I read a different explanation of this phrase on HN recently: the "prove" in "exception proves the rule" has the same meaning as the "prove" (or "proof") in "50% proof alcohol".
AIUI, in this context "proof" means "tests". The exception that tests the rule simply shows where the limits of the rules actually are.
Well, that's how I understood it, anyway. Made sense to me at the time I read the explanation, but I'm open to being convinced otherwise with sufficiently persuasive logic :-)
Alternatively then, perhaps safety developments in cars made them safer to drive around the same time? Or maybe advances in medicine made fatal crashes less likely? Or perhaps there’s some other explanation that doesn’t immediately spring to mind, it’s irrelevant.
The only point I’m really making is that the data OP referred to does not show an increase in excess deaths, and in fact specifically fails to find this.
Now when considering priorities, consider not just impact and age, but also times checked.
It's less expensive than going and deleting stuff. Unless you're automating deletions? In which case... I don't think I can continue this discussion.
https://static.googleusercontent.com/media/research.google.c... [pdf]
Some random blog post: https://medium.com/coinmonks/chubby-a-centralized-lock-servi...
You can run multiple copies/instances of chubby at the same time (like you could run two separate zookeepers). You usually run an odd number of them, typically 5. A group of chubby processes all managing the same namespace is a “cell”.
A while ago, nearly everything at Google had at least an indirect dependency on chubby being available (for service discovery etc), so part of the standard bringup for a datacenter was setting up a dc-specific chubby cell. You could have multiple SRE-managed chubby cells per datacenter/cluster if there was some reason for it. Anybody could run their own, but chubby-sre wasn’t responsible for anybody else’s, I think.
Finally, there was a global cell. It was both distributed across multiple datacenters and also contained endpoint information for the per-dc chubby cells, so if a brand new process woke up somewhere and all it knew how to access was the global chubby cell, it could bootstrap from that to talking to chubby in any datacenter and thence to any other process anywhere, more or less.
^ there’s a lot in there that I’m fuzzy about, maybe processes wake up and only know how to access local chubby, but that cell has endpoint info for the global one? I don’t think any part of this process used dns; service discovery (including how to discover the service discovery service) was done through chubby.
That data absolutely does show an increase in deaths, it's right there in the table. It fails to find a statistically significant increase in deaths. The most plausible explanation for that is that the study is underpowered because of the sample size, not that some mystical effect increased injuries without increasing deaths.
50% proof wouldn't be 25% ABV?
Why would you have a distributed lock service that (if I read right) has multiple redundant processes that can tolerate failures... and then require clients tolerate outages? Isn't the purpose of this kind of architecture so that each client doen't have to deal with outages?
Okay, keep deleting bug tickets.
Second, a global system being always available definitely doesn't mean it is always available everywhere. A single datacenter or even a larger region will experience both outages and network splits. It means that whatever you design on top of the super-available global system will have to deal with the global system being unavailable anyway.
TLDR is that the clients will have to tolerate outages (or at least frequent cut offs from the "global" state") anyway so it's better not to give them false promises.
Here in Europe we use hectopascals for pressure, as does pretty much everywhere else. It’s important to have a magnetic bearing in case your glass dies and you’re reliant on a paper map and compass, if you didn’t plan with magnetic bearings you’d be screwed if this happened in an area of high magnetic variation.
There are other fail safe methods of course all the way up to TCAS, but it’s not great for an oceanic flight to be outside of the system.
if -- well defined case else
Scream
while true do
Sleep(forever)
Same inswitch- default
Basically every known unknown its better to halt and let humans drive the fragile machine back into safe parameters- or expand the program.
PS: Yes, the else- you know what the else is, its the set of !(-well defined conditions) And its ever changing, if the well-defined if condition changes.
Another solution that is very foreign to us in sweng, but is common practice in, say, aviation, is to have that fallback plan in a big thick book, and to have a light that says "Oh it's time to use the fallback plan", rather than require users to diagnose the issue and remember the fallback.
This was one of the key ideas in the design of critical systems*: Instead of automating the execution of a big branching plan, it is often preferable to automate just the detection of the next desirable state, then let the users execute the transition. This is because, if there is time, it allows all users to be fully cognizant of the inner state of the system and the reasons for that state, in case they need to take over.
The worst of both worlds is to automate yourself into a corner, gunk everything up, and then require the user to come in and do a back-breaking cleanup just to get to the point where they can diagnose this. My factorio experiences mirror this last case perfectly.
* "Joint Cognitive Systems" - Hollnagle&Woods
Please be careful about removing the word 'excess' here. The word 'excess' is important, as it implies statistical significance (and is commonly understood to mean that - https://en.wikipedia.org/wiki/Excess_mortality).
I didn't argue that there was no change in the number of deaths, and I did not say that the table does not show any change in the number of deaths.
If your contention is that the sample size is too small, we can actually look at the full population data: https://en.wikipedia.org/wiki/Motor_vehicle_fatality_rate_in...
Notably, although 2002 had a higher number of fatalities, the number of miles traveled by road also increases. However, it represents a continuation of a growing trend since 1980 which continued until 2007, rather than being an exceptional increase in distance travelled.
Also, while 2002 was the worst year since 1990 for total fatalities, 2005 was worse.
Fatalities per 100 000 population in 2002 was 14.93, which was around a 1% worsening from the previous year. But 2002 does not really stand out, similar worsenings happened in 1988, 1993, 1994, 1995, 2005, 2012, 2015, and 2016 (to varying degrees).
One other observation is that in 2000, there was a population increase of 10 million (from 272 million to 282 million), while other years on either side are pretty consistently around 3 million. I'm not sure why this is the case, but if there's a change in the denominator in the previous year, this is also interesting if we're making a comparison (e.g. maybe the birth rate was much higher due to parents wanting 'millenial babies', none of whom are driving and so less likely to be killed in a crash, again, just a random thought, not a real argument from my side that I would try to defend).
The reason 'statistical significance' is important is because it allows us to look at it and say 'is there a specific reason for this, or is this within the bounds of what we would expect to see given normal variation?'. The data don't support a conclusion that there's anything special about 2001/2 that caused the variation.
Only way into the article it dawned to me that "nm" could stand for something else, and guess it was "nautical miles". Live and learn...
Still, it turned out to be an interesting read)
https://en.wikipedia.org/wiki/Exception_that_proves_the_rule
Seems like both interpretations are used widely.
as designed here sounds a big PR move to hide the fact they let an uncaught exception crash the entire software ...
How about : don't trust your inputs guys ?
Except that I've spent a good amount of time fixing bugs originally marked as `won't fix` because they actually became uh "will fix" (a decade later; lol).
> Put another way: you're working on a new version of your product. There are 900 issues in the tracker. Is this an urgent emergency where you need to shut down feature work and stabilize?
Do you not prioritize your bugs?
If the tracker is full of low priority bugs then it doesn't block the release. One thing we do is even if the bug would be high priority; if it's not new (as-in occurs in older releases) it doesn't (by default) block the next release.
> But wait, of those 900 issues are there any that should block the release? Now you have 900 to go through and determine. And unless you won't fix some of them you'll have the same thing in a few months.
You should only need to triage the bug once. It should be the same amount of work to triage a bug into low-priority as it is to mark it as `won't fix`. With (again), the big difference between that if a user searches for the bug they can find it and ideally keep updating the original bug instead of making a dozen new ones that need to be de-duplicated and triaged which is _more work_ not _less work_ for triagers.
> Work planning, triage, and other team tasks are not magic and voodoo, but on my experience the same engineers who object to the idea of "won't fix" are the ones who want to just code all day and never have to deal with the impact of a huge messy issue database on team and product.
If your idea of the product is ready for release is 0 bugs filed then that's something you're going to want to change. Every software gets released with bugs; often known bugs.
I will concede that if you "stop the count" or "stop testing" then yeah you'll have no issues reported. Doesn't make it the truth.
When the foundation of a technology stack has a failure, there are two different axis of failure.
1. How well do things keep working without the root service? Does every service that can be provided without it still keep going?
2. How automatically does the system recover when the root service is restored? Do you need to bring down the entire system and restore it in a precise order of dependencies?
It's nice if your system can tolerate the missing service and keep chugging along, but it is essential that your system not deadlock on the root service disappearing and stay deadlocked after the service is restored. At best, that turns a downtime of minutes into a downtime of hours, as you carefully turn down every service and bring them back up in a carefully proscribed order. At worse, you discover that your system that hasn't gone down in three years has acquired circular dependencies among its services, and you need to devise new fixes and work-arounds to allow it to be brought back up at all.
Having ambiguous names can likewise lead to disaster, as seen here, even if this incident had only mild consequences. (Having worked on place name ambiguity academically, I met people who flew to the wrong country due to city name ambiguity and more.)
At least artificial technical names/labels should be globally unambiguous.
EDIT: I mean the point of abbreviations is to facilitate communication. However with the world wide web connecting multiple countries, languages and fields of endeavour there are simply too many (for example) three letter acronyms in use. There are too many sources of ambiguity and confusion. Better to embrace long-form writing.
I'm sure there'd be a better way to handle this, but it sounds to me like the system failed in a graceful way and acted as specified.
Which shows that sometimes, remote isn't a viable option. If you have very critical infrastructure, it's advisable to have people physically very close to the data center so that they can access the servers if all other options fail. That's valid for aviation as well as for health care, banks, etc. Remote staff just isn't enough in these situations.
That's quite a DoS vulnerability...
AIUI, most VORs (and in the past, NDBs) are located at airports. And there is an airport at Deauville (DOL/LFRG):
* https://en.wikipedia.org/wiki/Deauville–Normandie_Airport
But the beacon is elsewhere, apparently in some random(?) field:
* https://www.google.com/maps/place/49°18'38.0%22N+0°18'45.0%2...
* This is also where we get terms like bulletproof - in the early days of firearms people wanted armor that would stop bullets from the relatively weak weapons, so armor smiths would shoot their work to prove them against bullets, and those that passed the test were bullet proof. Likewise alcohol proof rating comes from a test used to prove alcohol in the 1500s.
The unit of angular momentum is kg.m^2.s^-1, you're thinking of torque. Although even then we distinguish the Newton meter (Nm) from the Joule (J) even if they have the same dimensionality.
Maybe it's years of reading The Old New Thing and similar, maybe it's a career spent supporting "enterprise" software, but my personal experience is that fixing old bugs causing new bugs happens occasionally, but far more often it's that fixing old bugs often reveals many more old bugs that always existed but were never previously triggered because the software was "bug compatible" with the host OS, assumptions were made that because old versions never went outside of a certain range no newer versions ever would, and/or software just straight up tinkered with internal structures it never should have been touching which were legitimately changed.
Over my career I have chased down dozens of compatibility issues between software packages my clients used and new versions of their respective operating systems. Literally 100% of those, in the end, were the software vendor doing something that was not only wrong for the new OS but was well documented as wrong for multiple previous releases. A lot of blatant wrongness was unfortunately tolerated for far too long by far too many operating systems, browsers, and other software platforms.
Windows Vista came out in 2006 and every single thing that triggered a UAC prompt was a thing that normal user-level applications were NEVER supposed to be doing on a NT system and for the most part shouldn't have been doing on a 9x system either. As recently as 2022 I have had a software vendor (I forget the name but it was a trucking load board app) tell me that I needed to disable UAC during installs and upgrades for their software to work properly. In reality, I just needed to mount the appropriate network drive from an admin command prompt so the admin session saw it the same way as the user session. I had been telling the vendor the actual solution for years, but they refused to acknowledge it and fix their installer. That client got bought out so I haven't seen how it works in 2024 but I'd be shocked if anything had changed. I have multiple other clients using a popular dental software package where the vendor (famous for suing security researchers) still insists that everyone needs local admin to run it properly. Obviously I'm not an idiot and they have NEVER had local admin in decades of me supporting this package but the vendor's support still gets annoyed about it half the time we report problems.
As you might guess, I am not particularly favorable on Postel's Law w/r/t anything "big picture". I don't necessarily want XHTML style "a single missing close tag means the entire document is invalid" but I also don't want bad data or bad software to persist without everyone being aware of its badness. There is a middle ground where warnings are issued that make it clear that something is wrong and who's at fault without preventing the rest of the system from working. Call out the broken software aggressively.
tl;dr: If software B depends on a bug or unenforced boundary in software A, and software A fixing that bug or enforcing that boundary causes software B to stop working, that is 100% software B's problem and software A should in no way ever be expected to care about it. Place the blame where it belongs, software B was broken from the beginning we just hadn't been able to notice it yet.
I did not say that injuries did not go up. I did not say that the rate of deaths did not go up. I did not say that the rate of deaths did not “show in the stats” (unless by that you mean “was not statistically significant”).
I don’t need a model for the injuries question because that isn’t the point we’re arguing, but I might suggest something like “after 9/11, people took out more comprehensive health insurance policies, and so made claims for smaller injuries than they would have in previous years”.
A suitable model for explaining excess deaths might be something like “after 9/11, people chose to drive instead of fly, and driving is more dangerous per mile than flying is, resulting in excess deaths”. I’m not sure if that is your exact model, but it’s typically what people mean, happy for you to correct.
The problem with that model is that there’s no statistically significant increase in miles driven either. I can’t think of a model which would explain a higher fatality/injury rate per mile driven.
Out of interest, what would be your model for explaining why there were more injuries per mile?
If you found a single story of someone deciding to drive instead of take a plane in October 2001 because of 9/11, and that person died in a car crash, would that be enough for you to be satisfied that I am wrong?
> Out of interest, what would be your model for explaining why there were more injuries per mile?
Well, is there a statistically significant difference in the injuries per mile? Or even a difference at all? That the difference in injuries was statistically different and the difference in miles driven wasn't does not imply that the former changed by a larger proportion than the latter.
Pretty much everyone including all of the papers we've talked about assumes there was an increase in driving. Do you actually think driving didn't increase? Or is this just another area where these concepts of "statistically significant" and "excess" are obscuring things rather than enlightening us?
> If you found a single story of someone deciding to drive instead of take a plane in October 2001 because of 9/11, and that person died in a car crash, would that be enough for you to be satisfied that I am wrong?
I'm interested in knowing whether deaths actually increased and by how much; for me statistical significance or not is a means to an end, the goal is understanding the world. If we believe that driving did increase, then I don't think it's a reasonable null hypothesis to say that deaths did not increase, given what we know about the dangers of driving. Yes, it's conceivable that somehow driving safety increased by exactly the right amount to offset the increase in driving - but if that was my theory, I would want to actually have that theory! I can't fathom being satisfied with the idea that injuries somehow increased without increasing deaths and incurious not only about the mechanism, but about whether there really was a difference or not.
All of these things - miles driven, injuries, deaths - should be closely correlated. If there's evidence that that correlation actually comes apart here, I'm interested. If the correlations hold up, but the vagaries of the statistics are such that the changes in one or two of them were statistically significant and the other wasn't, meh - that happens all the time and doesn't really mean anything.
To respond to your question (admittedly not to answer directly), nationally there was indeed an increase in miles driven, an additional 59 billion miles in 2002 compared to 2001, and indeed there was an increase in deaths. I would also expect an increase in injuries as well.
Looking at this in isolation, you can say “oh so because of 9/11, people drove 59 billion more miles which resulted in more deaths and injuries”, but in my opinion real question if you want to understand the world better is “how many more miles would folks have driven in 2002 compared to 2001 in case 9/11 never happened”.
We obviously can’t know that, but we can look at data from other years. For example, from 1999 to 2000, the increase in miles driven was 56 billion, from 2000 to 2001, the increase was 50 billion and from 2002 to 2003 the increase was 34 billion, from 2003-2004 the increase was 75 billion.
Miles driven, injuries, deaths, are indeed all closely correlated. But so is population size, the price of oil, and hundreds of other factors. If your question is “did more people die on the roads in 2002 than in 2001”, the answer is yes. Again, I assume that the same is also true of injuries although I can’t support that with data.
That wasn’t OP’s assertion though, OP’s assertion was that closing down airspace does not lead to zero excess deaths. My argument is that the statistics do not support that conclusion, and that the additional deaths in 2002 cannot rigorously be shown even to be unusually high, let alone caused by 9/11.
The way this is supposed to work is that downstream systems should accept valid flight plans.
I would say that it’s not the upstream system’s responsibility to reject valid flight plans because of implementation details on a downstream system.
Specifically you have:
torque = force * distance
energy = force * distance
The only difference being that in the former the farce is perpendicular to the distance, and in the latter it's in line with the distance.
A vector based system could distinguish the two, but you don't always want to deal with vectors in your computations. (And I'm fairly sure there are problems where even using vectors ain't enough to avoid this problem.)
What we can show rigorously and directly is a tiny subset of what we know. If your standard for saying that event x caused deaths is that we have a statistically significant direct correlation between event x and excess deaths that year, you're going to find most things "don't cause deaths". Practically every dangerous food contaminant is dangerous on the basis of it causes an increase in x condition and x condition is known to be deadly, not because we can show directly that people died from eating x. Hell, even something like mass shootings probably aren't enough deaths to show up directly in death numbers for the year.
I think it's reasonable to say that something we reasonably believe causes deaths that would not have occurred otherwise causes excess deaths. If you actually think the causal chain breaks down - that 9/11 didn't actually lead to fewer people flying, or that actually didn't lead to more people driving, or that extra driving didn't actually lead to more deaths - then that's worth discussing. But I don't see any value in applying an unreasonably high standard of statistical proof when our best available model of the world suggests there was an increase in deaths and it would actually be far more surprising (and warrant more study) if there wasn't such an increase.
I work with great QAs all day, and if one of them heard that there are duplicate area codes, there would be a bunch of test cases appearing with all the possible combinations
It isn't.
> Hell, even something like mass shootings probably aren't enough deaths to show up directly in death numbers for the year.
Yes, there's (probably) no statistically significant link between mass shootings and excess deaths (maybe with the exception of school shootings and excess deaths in the population of school children). But you can directly link 'a shooting' and 'a death', so you don't need to look at statistics to work out if it's true that shootings cause deaths. Maybe if mass shootings started to show up in excess deaths numbers, we'd see a different approach to gun ownership, but that's a separate discussion. You can't do the same with 'closing airspace' and 'deaths on the road'.
When you're looking at a big event like this, there's a lot of other things that can happen. People could be too scared to take trips they might otherwise have taken (meaning a reduction in mileage as folks are not driving to the airport), or the opposite, 9/11 might have inspired people to visit family that they otherwise might not have visited (meaning in increase in mileage). Between 2000 and 2003, the price of gas went down, which might have encouraged people to drive more in general, or to choose to drive rather than fly for financial reasons (although if you wanted to mark that down as 'caused by 9/11' that's probably an argument you could win). You can throw ideas out all day long. The way we validate which ideas have legs is by looking at statistical significance.
Here's some more numbers for you, in 2000, there were 692 billion revenue passenger miles flown in the US. In 2002, it was 642 billion. So we can roughly say that there were 50 billion fewer miles flown in 2002. But the actual number of miles driven in 2002 was 100 billion higher than in 2000 (and note, this is vehicle miles, not passenger miles, whereas the airline numbers are counting passengers). So clearly something else is at play, you can't (only) attribute the increase in driving to people driving instead of flying.
> If you actually think the causal chain breaks down - that 9/11 didn't actually lead to fewer people flying, or that actually didn't lead to more people driving, or that extra driving didn't actually lead to more deaths - then that's worth discussing
I believe that the causal chain does exist but it's weakened at every step. Yes, I think 9/11 led to fewer people flying, but I think that only a proportion of those journeys were substituted for driving, and some smallish percentage of that is further offset by fewer journeys to and from the airport. I think that the extra driving probably did lead to a small number of additional deaths, but again this question of 'is one additional death enough for you to think I'm wrong' comes back.
If I throw aside all scientific ways of looking at it, my belief is that in terms of direct causation, probably, in the 12 months following 9/11, somewhere between 50 and 500 people died on the roads who would not have died on the roads if 9/11 had not happened. But a lot of those were travelling when the airspace was not closed.
If we look at the number of people who died because they made a trip by car that they would have otherwise made by plane but couldn't because US airspace was closed (i.e. on 9/11 itself and during the ground stop on the 12th), you're looking at what I believe to be a very, very small number of people, maybe even zero.