I wonder could that be made faster by using AVX instructions; they allow to find the minimum value among several u32 values, but not immediately its index.
e.g. using 4 accumulators instead of 1 accumulator in the naive for loop gives me around a 15%-20% speedup (Not using rust, extremely scalar terrible naive C code via g++ with -funroll-all-loops -march=native -O3)
if we're expressing argmax via the obvious C style naive for loop, or a functional reduce, with a single accumulator, we've forcing a chain dependency that isn't really part of the problem. but if we don't care which argmax-ing index we get (if there are multiple minimal elements in the array) then instead of evaluating the reductions in a single rigid chain bound by a single accumulator, we can break the chain and get our hardware to do more work in parallel, even if we're only single threaded.
anonymoushn is doing something much cleverer again using intrinsics but there's still that idea of "how do we break the dependency chain between different operations so the cpu can kick them off in parallel"