Faster Argmin on Floats

(algorithmiker.github.io)

19 points return_to_monke | 5 comments | 18 Sep 25 16:20 UTC | HN request time: 0s | source

Show context

TheDudeMan ◴[20 Sep 25 04:36 UTC] No.45310426[source]▶

How fast if you write a for loop and keep track of the index and value of the smallest (possibly treating them as ints)?

replies(1): >>45310547 #

1. nine_k ◴[20 Sep 25 05:04 UTC] No.45310547[source]▶

>>45310426 #

I hazard to guess that it would be the same, because the compiler would produce a loop out of .iter(), would expose the loop index via .enumerate(), and would keep track of that index in .min_by(). I suppose the lambda would be inlined, maybe even along with comparisons.

I wonder could that be made faster by using AVX instructions; they allow to find the minimum value among several u32 values, but not immediately its index.

replies(3): >>45311141 #>>45311388 #>>45311983 #

2. anonymoushn ◴[20 Sep 25 07:05 UTC] No.45311141[source]▶

>>45310547 (TP) #

you can have some vector registers n_acc, ns, idx_acc, idxs, then you can do

  // (initialize ns and idxs by reading from the array
  //  and adding the apropriate constant to the old value of idxs.)
  n_acc = min(n_acc, ns);
  const is_new_min = eq(n_acc, ns);
  idx_acc = blend(idx_acc, idxs, is_new_min);

Edit: I wrote this with min, eq, blend but you can actually use cmpgt, min, blend to avoid having a dependency chain through all three instructions. I am just used to using min, eq, blend because of working on unsigned values that don't have cmpgt

you can consult the list of toys here: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

3. shoo ◴[20 Sep 25 08:00 UTC] No.45311388[source]▶

>>45310547 (TP) #

Even without AVX it seems possible to do better than a naive C style for loop argmax by manually unrolling the loop a bit and maintaining multiple accumulators

e.g. using 4 accumulators instead of 1 accumulator in the naive for loop gives me around a 15%-20% speedup (Not using rust, extremely scalar terrible naive C code via g++ with -funroll-all-loops -march=native -O3)

if we're expressing argmax via the obvious C style naive for loop, or a functional reduce, with a single accumulator, we've forcing a chain dependency that isn't really part of the problem. but if we don't care which argmax-ing index we get (if there are multiple minimal elements in the array) then instead of evaluating the reductions in a single rigid chain bound by a single accumulator, we can break the chain and get our hardware to do more work in parallel, even if we're only single threaded.

anonymoushn is doing something much cleverer again using intrinsics but there's still that idea of "how do we break the dependency chain between different operations so the cpu can kick them off in parallel"

4. TinkersW ◴[20 Sep 25 10:10 UTC] No.45311983[source]▶

>>45310547 (TP) #

Yes this is fairly easy to write in AVX, and you can track the index also, honestly the code is cleaner and nicer to read than this mildly obfuscated rust.

replies(1): >>45312039 #

5. imtringued ◴[20 Sep 25 10:25 UTC] No.45312039[source]▶

>>45311983 #

You're referring to nothing and nothing. What exactly are you talking about? It certainly can't be the trivial to understand one liners in the blog.

↑