←back to thread

158 points kenjackson | 1 comments | | HN request time: 0.213s | source
Show context
roblabla ◴[] No.41031699[source]
This is some very poor journalism. The linux issues are so, so very different from the windows BSOD issue.

The redhat kernel panics were caused by a bug in the kernel ebpf implementation, likely a regression introduced by a rhel-specific patch. Blaming crowdstrike for this is stupid (just like blaming microsoft for the crowdstrike bsod is stupid).

For background, I also work on a product using eBPFs, and had kernel updates cause kernel panics in my eBPF probes.

In my case, the panic happened because the kernel decided to change an LSM hook interface, adding a new argument in front of the others. When the probe gets loaded, the kernel doesn’t typecheck the arguments, and so doesn’t realise the probe isn’t compatible with the new kernel. When the probe runs, shit happens and you end up with a kernel panic.

eBPF probes causing kernel panics are almost always indication of a kernel bug, not a bug in the ebpf vendor. There are exceptions of course (such as an ebpf denying access to a resource causing pid1 to crash). But they’re very few.

replies(4): >>41031896 #>>41032164 #>>41032610 #>>41034621 #
xyzzy123 ◴[] No.41034621[source]
It's not clear to me they are so different but maybe I am not "sufficiently smart".

To me this feels like a complicated question - both Linux and Windows organisations are quite good at kernel reliability engineering even though quite different organisational structures and engineering approaches are involved.

Yes "the wrong people were trusted" but I don't see how we can completely solve this with engineering.

replies(1): >>41040545 #
roblabla ◴[] No.41040545[source]
> It's not clear to me they are so different but maybe I am not "sufficiently smart".

They're different because linux promises "eBPF are safe and cannot crash the kernel", and failed to deliver on that, while Microsoft says "drivers are all-powerful and as such must be written with care", and CrowdStrike did not heed this warning.

> Yes "the wrong people were trusted" but I don't see how we can completely solve this with engineering.

I mean, we could solve the "third party software fucks the kernel up" problem easily with engineering: providing userspace APIs to do stuff that currently need kernelspace access. There's no inherent reason security products (or, really, any products) needs to live in the kernel, it's just that there are no APIs to do this job, so security products have to go there. If Microsoft provided a good API doing what the custom drivers currently do, most security products would drop their driver in a heartbeat.

For instance, macOS fixed this exact issue a couple years ago by introducing Endpoint Security Framework, a userspace API that allows watching a bunch of events, and authorizing whether they should be allowed or blocked. It's a well-designed API that should obsolete the need for kernelspace access in security products.

replies(1): >>41043586 #
j2bryson ◴[] No.41043586[source]
So what happened with the linux bug? Presumably people fixed the OS side problem straight away?
replies(1): >>41058707 #
1. roblabla ◴[] No.41058707[source]
kernel-5.14.0-427.13.1.el9_4 broke it. It was released in Apr 30, 2024, with RHEL 9.4 (this was the RHEL 9.4 release kernel).

According to the comments on https://access.redhat.com/solutions/7068083, RHEL became aware of the issue on May 3, 2024.

A workaround was identified (configuring CS to use the kernel module backend instead of the ebpf backend) on May 9, 2024.

RHEL then fixed it in kernel-5.14.0-427.18.1.el9_4, in May 23, 2024.

So the bug was fixed in ~20 days from the moment it was reported.

It's unclear whether this issue was caused by a RHEL-specific backport/patch or was also present in mainline kernels.