Most active commenters

pbhjpbhj(3)

Popular/hot comments

>>42165540 #
>>42164502 #

←back to thread

Bpftune uses BPF to auto-tune Linux systems

(github.com)

Show context

gausswho ◴[17 Nov 24 14:33 UTC] No.42164371[source]▶

>>42163597 (OP) #

With this tool I am wary that I'll encounter system issues that are dramatically more difficult to diagnose and troubleshoot because I'll have drifted from a standard distro configuration. And in ways I'm unaware of. Is this a reasonable hesitation?

replies(6): >>42164481 #>>42164533 #>>42164535 #>>42164760 #>>42164990 #>>42168400 #

1. pbhjpbhj ◴[17 Nov 24 14:54 UTC] No.42164481[source]▶

>>42164371 #

>"bpftune logs to syslog so /var/log/messages will contain details of any tuning carried out." (from OP GitHub readme)

The rmem example seems to allay fears that it will make changes one can't reverse.

replies(1): >>42164502 #

2. admax88qqq ◴[17 Nov 24 14:57 UTC] No.42164502[source]▶

>>42164481 (TP) #

It’s not a questions of being able to reverse. It’s a question of being able to diagnose that one of these changes even was the problem and if so which one.

replies(3): >>42165004 #>>42166373 #>>42168709 #

3. nehal3m ◴[17 Nov 24 16:25 UTC] No.42165004[source]▶

>>42164502 #

If they can be reversed individually you can simply deduce by rolling back changes one by one, no?

replies(2): >>42165128 #>>42165540 #

4. jstanley ◴[17 Nov 24 16:45 UTC] No.42165128{3}[source]▶

>>42165004 #

Only if you already suspect that this tool caused the problem.

5. spenczar5 ◴[17 Nov 24 17:40 UTC] No.42165540{3}[source]▶

>>42165004 #

Suppose you run a fleet of a thousand machines. They all autotune. They are, lets say, serving cached video, or something.

You notice that your aggregate error rate been drifting upwards since using bpftune. It turns out, in reality, there is some complex interaction between the tuning and your routers, or your TOR switches, or whatever - there is feedback that causes oscillations in a tuned value, swinging between too high and too low.

Can you see how this is not a matter of simple deduction and rollbacks?

This scenario is plausible. Autotuning generally has issues with feedback, since the overall system lacks control theoretic structure. And the premise here is that you use this to tune a large number of machines where individual admin is infeasible.

replies(5): >>42166437 #>>42166446 #>>42166449 #>>42167131 #>>42167792 #

6. pbhjpbhj ◴[17 Nov 24 19:28 UTC] No.42166373[source]▶

>>42164502 #

I focused primarily on guesswho's "in ways I am unaware of".

Your issue appears to be true for any system change. Although, risk will of course vary.

7. pbhjpbhj ◴[17 Nov 24 19:35 UTC] No.42166437{4}[source]▶

>>42165540 #

>not only can we observe the system and tune appropriately, we can also observe the effect of that tuning and re-tune if necessary. //

Does sound like a potential way to implement literal chaos.

Surely it's like anything else, you do pre-release testing and balance the benefits for you against the risks?

8. Modified3019 ◴[17 Nov 24 19:37 UTC] No.42166446{4}[source]▶

>>42165540 #

Sounds like you have your answer of “don’t use it” then.

9. pstuart ◴[17 Nov 24 19:37 UTC] No.42166449{4}[source]▶

>>42165540 #

In that scenario you could run it on a couple servers, compare and contrast, and then apply globally via whatever management tool you use.

10. KennyBlanken ◴[17 Nov 24 20:52 UTC] No.42167131{4}[source]▶

>>42165540 #

Presumably one would use autotune to find optimized parameters, and then roll those out via change control, either one parameter at a time, or a mix of parameters across the systems.

Alternatively: if you have a fleet of thousands of machines you can very easily do a binary search with them to a)establish the problem with the auto-tuner and then b)which of the changes it settled on are causing your problems.

I get the impression you've never actually managed a "fleet" of systems, because these techniques would have immediately occurred to you.

replies(1): >>42167272 #

11. spenczar5 ◴[17 Nov 24 21:09 UTC] No.42167272{5}[source]▶

>>42167131 #

Certainly when we managed Twitch’s ~10,000 boxes of video servers, neither of the tasks you describe would have been simple. We underinvested in tools, for sure. Even so, I don’t think you can really argue that dynamically changing configs like this are going to make life easier!

12. toast0 ◴[17 Nov 24 22:10 UTC] No.42167792{4}[source]▶

>>42165540 #

When you have a thousand machines, you can usually get feedback pretty quick, in my experience.

Run the tune on one machine. Looks good? Put it on ten. Looks good? Put it on one hundred. Looks good? Put it on everyone.

Find an issue a week later, and want to dig into it? Run 100 machines back on the old tune, and 100 machines with half the difference. See what happens.

13. yourapostasy ◴[18 Nov 24 00:51 UTC] No.42168709[source]▶

>>42164502 #

Record changes in git and then git bisect issues, maybe?

Without change capture, solid regression testing, or observability, it seems difficult to manage these changes. I’d like to how others are managing these kinds of changes to readily troubleshoot them, without lots of regression testing or observability, if anyone has successes to share.

↑