←back to thread

49 points melenaboija | 1 comments | | HN request time: 0.001s | source
Show context
QuadmasterXLII ◴[] No.41852483[source]
I’m really surprised by the performance of the plain C++ version. Is automatic vectorization turned off? Frankly this task is so common that I would half expect compilers to have a hard coded special case specifically for fast dot products

Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.

replies(1): >>41853866 #
mshockwave ◴[] No.41853866[source]
which flags did you use and which compiler version?
replies(1): >>41853882 #
QuadmasterXLII ◴[] No.41853882[source]
clang 19, -O3 -ffast-math -march=native
replies(1): >>41853962 #
mshockwave ◴[] No.41853962[source]
can confirm fast math makes the biggest difference
replies(2): >>41854114 #>>41854399 #
1. mkristiansen ◴[] No.41854114{3}[source]
Fast Math basically means "who cares about standards just add in whatever order you want" :)