Reminds me of OpenVMS Galaxy on DEC Alpha systems, which allowed multiple instances of the OS to run side by side on the same hardware without virtualization.

https://www.digiater.nl/openvms/doc/alpha-v8.3/83final/aa_re...

replies(1): >>45308023 #

11. sedatk ◴[19 Sep 25 22:03 UTC] No.45307257[source]▶

>>45305508 #

> - Improved fault isolation between different workloads

Yes.

replies(1): >>45307371 #

12. ATechGuy ◴[19 Sep 25 22:15 UTC] No.45307371{3}[source]▶

>>45307257 #

That's what the author is claiming. Practically, VM-level strong fault isolation cannot be achieved without isolation support from the hardware aka virtualization.

replies(1): >>45307618 #

13. eqvinox ◴[19 Sep 25 22:41 UTC] No.45307618{4}[source]▶

>>45307371 #

Hardware without something like SR-IOV is straight up going to be unshareable for the foreseeable future; things like ring buffers would need a whole bunch of coordination between kernels to share. SR-IOV (or equivalent) makes it workable, an IOMMU (or equivalent) then provides isolation.

replies(1): >>45308047 #

14. eqvinox ◴[19 Sep 25 22:42 UTC] No.45307627[source]▶

>>45306246 #

Docker is the wrong thing to compare against, especially considering it is an application and not a technology; the technology would be containerization. This competes against hardware virtualization support, if anything.

15. tremon ◴[19 Sep 25 22:45 UTC] No.45307653[source]▶

>>45306817 #

It should be possible in theory, as long as both use the same communication interface. In practice, I think getting it to work on just one kernel is already a huge amount of work.

replies(1): >>45309722 #

16. tremon ◴[19 Sep 25 23:01 UTC] No.45307792[source]▶

>>45302721 (OP) #

"while sharing the underlying hardware resources"? At the risk of sounding too positive, my guess is that hell will freeze over before that will work reliably. Alternating access between the running kernels is probably the "easy" part (DMA and command queues solve a lot of this for free), but I'm thinking more of all the hardware that relies on state-keeping and serialization in the driver. There's no way that e.g. the average usb or bluetooth vendor has "multiple interleaved command sequences" in their test setup.

I think Linux will have to move to a microkernel architecture before this can work. Once you have separate "processes" for hardware drivers, running two userlands side-by-side should be a piece of cookie (at least compared to the earlier task of converting the rest of the kernel).

Will be interesting to see where this goes. I like the idea, but if I were to go in that direction, I would choose something like a Genode kernel to supervise multiple Linux kernels.

replies(2): >>45307924 #>>45307985 #

17. brcmthrowaway ◴[19 Sep 25 23:05 UTC] No.45307844[source]▶

>>45305550 #

This was underrated!

18. elteto ◴[19 Sep 25 23:15 UTC] No.45307924[source]▶

>>45307792 #

You just don't share certain devices, like Bluetooth. The "main" kernel will probably own the boot process and manage some devices exclusively. I think the real advantage is running certain applications isolated within a CPU subset, protected/contained behind a dedicated kernel. You don't have the slowdown of VMs, or have to fight against the isolation sieve that is docker.

replies(1): >>45309251 #

19. vlovich123 ◴[19 Sep 25 23:23 UTC] No.45307985[source]▶

>>45307792 #

Is there anything that says that multiple kernels will be responsible for owning the drivers for HW? It could be that one kernel owns the hardware while the rest speak to the main kernel using a communication channel. That's also presumably why KHO is a thing because you have to hand over when shutting down the kernel responsible for managing the driver.

20. IAmLiterallyAB ◴[19 Sep 25 23:25 UTC] No.45308000[source]▶

>>45302721 (OP) #

What's preventing a compromised kernel on one core from hijacking the other cores? This doesn't seem like much of a security boundary

replies(2): >>45308133 #>>45308489 #

21. skissane ◴[19 Sep 25 23:27 UTC] No.45308023[source]▶

>>45307041 #

IBM mainframes and Power servers have “partitions” (LPARs). My understanding of how they work, is they actually are software-based virtualisation, but the hypervisor is in the system firmware, not the OS. And some of the firmware is loaded from disk at boot-up, making it even closer to something like Xen-labelling it as “hardware” not “software” is more about marketing (and which internal teams own it within IBM) than than technical reality. Their mainframe partitioning system, PR/SM, apparently began life as a stripped-down version of VM/CMS, although I’m not sure how close the relationship between PR/SM and z/VM is in current releases.

This sounds like running multiple kernels in a shared security domain, which reduces the performance cost of transitions and sharing, but you lose the reliability and security advantages that a proper VM gives you. It reminds me of coLinux (essentially, a Linux kernel as a Windows NT device driver)

Does anyone have more details on how OpenVMS Galaxy was actually implemented? I believe it was available for both Alpha and Itanium, but not yet x86-64 (and probably never…)

22. skissane ◴[19 Sep 25 23:30 UTC] No.45308047{5}[source]▶

>>45307618 #

You could have a “nanokernel” which owns the ring buffers and the other kernels act as its clients… or for a “primary kernel” which owns the ring buffers and exposes an API the other kernels could call. If different devices have different ring buffers, the “primary kernel” could be different for each one.

23. viraptor ◴[19 Sep 25 23:42 UTC] No.45308133[source]▶

>>45308000 #

Nothing prevents it if you achieve code execution. But where it helps is scenarios like syscall / memory mapping exploits where a user process can only affect resources attached to their current kernel. For example https://dirtycow.ninja/ would have a limited scope.

24. loeg ◴[20 Sep 25 00:28 UTC] No.45308483[source]▶

>>45302721 (OP) #

Insane idea, but very cool.

25. ◴[20 Sep 25 00:28 UTC] No.45308489[source]▶

>>45308000 #

26. yalogin ◴[20 Sep 25 01:14 UTC] No.45308846[source]▶

>>45302721 (OP) #

It’s not clear to me but do these kernels run directly on the hardware? If so how are they able to talk to each other, DMA? That could open up some security flaws, hopefully they thought through that

replies(1): >>45309225 #

27. agentkilo ◴[20 Sep 25 01:44 UTC] No.45309225[source]▶

>>45308846 #

IIUC, yes, all the kernels involved run directly on the hardware, in a "cooperative" way, i.e. they must agree on not touching others' memory regions.

I think the architecture assumes all loaded kernels are trusted, and imposes no isolation other than having them running on different CPUs.

Given the (relative) simplicity of the PoC, it could be really performant.

replies(2): >>45309255 #>>45309270 #

28. yjftsjthsd-h ◴[20 Sep 25 01:46 UTC] No.45309251{3}[source]▶

>>45307924 #

That's fine for

  - Enhanced security through kernel-level separation
  - Better resource utilization than traditional VM (KVM, Xen etc.)

but I don't think it works for

  - Improved fault isolation between different workloads
  - Potential zero-down kernel update with KHO (Kernel Hand Over)

since if the "main" kernel crashes or is supposed to get upgraded then you have to hand hardware back to it.

replies(2): >>45309440 #>>45311469 #

29. yalogin ◴[20 Sep 25 01:46 UTC] No.45309255{3}[source]▶

>>45309225 #

Wonder what the use cases are. Doesn’t feel like the kernels are hotswappable, so why is it preferred over VMs?

replies(1): >>45310511 #

30. yjftsjthsd-h ◴[20 Sep 25 01:48 UTC] No.45309270{3}[source]▶

>>45309225 #

Can't the kernel set up hardware-backed memory maps to partially blind itself to other memory regions? (Only "partially" because even then I expect it could just change the mappings, but it's still a protection against accidental corruption)

31. raron ◴[20 Sep 25 02:03 UTC] No.45309440{4}[source]▶

>>45309251 #

> since if the "main" kernel crashes or is supposed to get upgraded then you have to hand hardware back to it.

Isn't that similar to starting up from hibernate to disk? Basically all of your peripherals are powered off and so probably can not keep their state.

Also you can actually stop a disk (member of a RAID device), remove the PCIe-SATA HBA card it is attached to, replace it with a different one, connect all back together without any user-space application noticing it.

32. viraptor ◴[20 Sep 25 02:36 UTC] No.45309722{3}[source]▶

>>45307653 #

It's been done with more crazy setups already though: http://www.colinux.org/ win+lin

33. joseph2024 ◴[20 Sep 25 02:48 UTC] No.45309805[source]▶

>>45305550 #

HP printers are similar. They run Linux on two cores and an RTOS on the other.

34. sargun ◴[20 Sep 25 03:19 UTC] No.45310008[source]▶

>>45305951 #

The author (Cong Wang) is building all sorts of neat stuff. Recently, they built kernelscript: https://github.com/multikernel/kernelscript -- another DSL for BPF that's much more powerful than the C alternatives, without the complexity of C BPF. Previously, they were at Bytedance, so there's a lot of hope that they understand the complexities of "production".

35. rurban ◴[20 Sep 25 04:36 UTC] No.45310423[source]▶

>>45305951 #

I see. Even better than Xen, but needs much more memory than all the kvm instances. And as I heard memory is the real deal for mass hosters, not speed. So I am sceptical. I also don't understand how it handles concurrent writes and states of shared hardware. Seems like a lot of overhead compared to kvm or Xen.

36. josemanuel ◴[20 Sep 25 04:37 UTC] No.45310429[source]▶

>>45302721 (OP) #

How are IOMMUs managed?

37. yxhuvud ◴[20 Sep 25 04:55 UTC] No.45310511{4}[source]▶

>>45309255 #

If nothing else, it is a path to making them hotswappable.

38. pabs3 ◴[20 Sep 25 05:46 UTC] No.45310746[source]▶

>>45302721 (OP) #

You used to also be able to get the opposite; one Linux kernel with a unified userspace distributed across a cluster.

https://sourceforge.net/projects/kerrighed/

replies(2): >>45311252 #>>45311619 #

39. intermerda ◴[20 Sep 25 06:11 UTC] No.45310875[source]▶

>>45305324 #

Tim Roscoe gave an interesting Keynote at OSDI '21 titled "It's Time for Operating Systems to Rediscover Hardware" - https://www.youtube.com/watch?v=36myc8wQhLo. He was involved with the Barrelfish project.

40. esseph ◴[20 Sep 25 06:30 UTC] No.45310967[source]▶

>>45306246 #

If you want some security improvements, move from docker to podman rootless + distroless containers.

If you need more security/isolation, go to a VM or bare metal.

41. rwmj ◴[20 Sep 25 07:26 UTC] No.45311252[source]▶

>>45310746 #

That's cool! Similar is the idea of running a single large VM across multiple hosts. There have been several iterations of that idea, the latest being a presentation at this year's KVM Forum: GiantVM: A Many-to-one Virtualization System Built Atop the QEMU/KVM Hypervisor - Songtao Xue, Xiong Tianlei, Muliang Shou https://kvm-forum.qemu.org/2025/

42. samus ◴[20 Sep 25 08:18 UTC] No.45311469{4}[source]▶

>>45309251 #

The old kernel boots the new kernel, possibly in a "passive" mode, performs a few sanity checks of the new instance, hands over control, and finally shuts itself down.

43. samus ◴[20 Sep 25 08:23 UTC] No.45311502[source]▶

>>45302721 (OP) #

This could open up ways to run Linux as a guest kernel of proper microkernel operating systems to aid with hardware compatibility.

44. PhilipRoman ◴[20 Sep 25 08:52 UTC] No.45311619[source]▶

>>45310746 #

I wonder if modern numa-aware software could take advantage of this if the Linux APIs report the correct topology.

45. da-x ◴[20 Sep 25 09:26 UTC] No.45311766[source]▶

>>45302721 (OP) #

There are various hardware singletons that need to be managed for this to work properly. This raises many questions.

Which of the kernel does the PCI enumeration, for instance, and how it is determined which kernel gets ownership over a PCI device? How about ACPI? Serial ports?

How does this architecture transfers ownership over RAM between each kernel, or is it a fixed configuration? How about NUMA-awareness? (Likely you would want to partition systems so that RAM is along with the CPUs of the same NUMA node).

Looks to me that one kernel would need to be have 'hypervisor'-like behavior in order to divvy up resources to other kernels. I think PVM (https://lwn.net/Articles/963718/) would be a preferred solution in this case, because the software stack of managing hypervisor resources can already be reused with it.

↑