←back to thread

158 points kenjackson | 4 comments | | HN request time: 0.602s | source
Show context
roblabla ◴[] No.41031699[source]
This is some very poor journalism. The linux issues are so, so very different from the windows BSOD issue.

The redhat kernel panics were caused by a bug in the kernel ebpf implementation, likely a regression introduced by a rhel-specific patch. Blaming crowdstrike for this is stupid (just like blaming microsoft for the crowdstrike bsod is stupid).

For background, I also work on a product using eBPFs, and had kernel updates cause kernel panics in my eBPF probes.

In my case, the panic happened because the kernel decided to change an LSM hook interface, adding a new argument in front of the others. When the probe gets loaded, the kernel doesn’t typecheck the arguments, and so doesn’t realise the probe isn’t compatible with the new kernel. When the probe runs, shit happens and you end up with a kernel panic.

eBPF probes causing kernel panics are almost always indication of a kernel bug, not a bug in the ebpf vendor. There are exceptions of course (such as an ebpf denying access to a resource causing pid1 to crash). But they’re very few.

replies(4): >>41031896 #>>41032164 #>>41032610 #>>41034621 #
1. watt ◴[] No.41032610[source]
There should not be any software caused crashes during operation of software. Every NPE that is not caused by hardware issue, is a null pointer not properly handled. Software needs to handle their null checks. Missing a check (or precondition, or validation) is squarely on Microsoft.

> always indication of a kernel bug and before

but:

> blaming microsoft for the crowdstrike bsod is stupid

and who owns the kernel in windows land? Microsoft. how is it stupid to blame Microsoft for not making kernel safe?

replies(3): >>41032783 #>>41032802 #>>41033041 #
2. Diggsey ◴[] No.41032783[source]
Microsoft don't own the kernel in that sense: anyone can write kernel drivers for windows... While there are some things that the kernel can do to protect against a bad driver, it's not a security boundary, so ultimately bad code can cause crashes.

AIUI, Microsoft actually has good tooling for validating drivers before they are deployed, but it requires that you actually run the validation...

3. mebeim ◴[] No.41032802[source]
Let me give you an analogy: Volvo is known to manufacture very safe cars. Now let's say I drive a Volvo car with a box of dynamite on the passenger seat. I stop at a red light but hit the brake a bit too hard and the box of dynamite falls and causes an explosion, disintegrating everything in a 20-foot radius. So whose fault was it? Volvo?

> Missing a check (or precondition, or validation) is squarely on Microsoft.

Missing a check for presence of dynamite before allowing me to start the car is squarely on Volvo!

You see how silly that sounds?

Now, back to being serious: MS cannot possibly control and validate everything you decide to install and run on your system, specially if the things you install are kernel drivers. It is simply impossible. If you install a kernel driver developed by a 3rd party company, and that driver crashes your system because the devs at that company forgot to perform proper validation of data, well... that's on them. Even if MS wanted, they wouldn't be able to verify the soundness of any piece of code that is installed as a driver and runs with kernel level privileges. That'd require solving the halting problem.

4. roblabla ◴[] No.41033041[source]
@watt there's a big difference here.

eBPF is a bytecode that is interpreted in the kernel, with the explicit goal to allow writing code that executes at the kernel-level in a safe way. Any kernel panic (again, short of pid1 kills) is considered a bug, and could even potentially be exploited to gain capabilities in some cases. Here, the kernel explicitly says "this is safe", so any problem within is a bug in the kernel.

In contrast, a kernel module/driver is just some third-party code that is loaded in the kernel. Here, all bets are off: it is up to the third-party to do their job properly and make sure their code is correct.

In this case, CrowdStrike explicitly opted into writing a kernel module, and then failed to, as you say, "handle their null check". The segfault wasn't in Windows code, it was in CrowdStrike code that lives in the kernel. Crowdstrike should have handled their nullcheck, failed, and that will lead to a BSOD.

To be clear: the only way microsoft could make the kernel safer here is by disallowing kernel modules entirely. While there is an argument to be made that this could be a good idea, it is a bit beside the point.