The long road to lazy preemption in the Linux CPU scheduler

1. hamilyon2 ◴[19 Oct 24 12:32 UTC] No.41887445[source]▶

How tight is scheduler coupled to the rest of kernel code?

If one wanted to drastically simplify scheduler, for example for some scientific application which doesn't care about preemption at all, can it be done in clean, modular way? And will be any benefit?

replies(4): >>41887725 #>>41887861 #>>41888151 #>>41888319 #

2. p_l ◴[19 Oct 24 13:42 UTC] No.41887725[source]▶

>>41887445 (TP) #

If you want to run a set of processes with as little preemption as possible, for example in HPC setting, your most powerful option is to reboot the system with a selection of cores (exact amount will differ on your needs) set as isolated cpus and manually put your task there with taskset - but then you need to really manually allocate tasks to CPUs, it's trivial to end up with all tasks on wrong CPU.

The standard way is to set interrupt masks so they don't go to "work" cpus and use cpusets to only allow specific cgroup to execute on given cpuset.

3. kevin_thibedeau ◴[19 Oct 24 14:11 UTC] No.41887861[source]▶

>>41887445 (TP) #

I'd just use RT Linux. That has its own basic scheduler with the kernel scheduler running as the idle task. Real time tasks get priority over everything else.

4. toast0 ◴[19 Oct 24 14:59 UTC] No.41888151[source]▶

>>41887445 (TP) #

You can get 95% of the way there by running a clean system with nearly no daemons, and your application setup to run with one os thread per cpu thread, with cpu pinning so they don't move.

Whatever the scheduler does should be pretty low impact, because the runlist will be very short. If your application doesn't do much I/O, you won't get many interrupts either. If you can run a tickless kernel (is that still a thing, or is it normal now?), you might not get any interrupts for large periods.

5. marcosdumay ◴[19 Oct 24 15:20 UTC] No.41888319[source]▶

>>41887445 (TP) #

Last time I looked, it was surprisingly decoupled.

But the reason for drastically simplifying it would be to avoid bugs, there isn't much performance to gain compared to a well-set default one (there are plenty of settings tough). And there haven't been many bugs there. On most naive simplifications you will lose performance, not gain it.

If you are running a non-interactive system, the easiest change to make is to increase the size of the process time quantum.