(joseprupi.github.io)

66 points melenaboija | 1 comments | 14 Oct 24 21:43 UTC | HN request time: 0.235s | source

Show context

QuadmasterXLII ◴[15 Oct 24 19:57 UTC] No.41852483[source]▶

I’m really surprised by the performance of the plain C++ version. Is automatic vectorization turned off? Frankly this task is so common that I would half expect compilers to have a hard coded special case specifically for fast dot products

Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.

replies(1): >>41853866 #

mshockwave ◴[15 Oct 24 22:39 UTC] No.41853866[source]▶

>>41852483 #

which flags did you use and which compiler version?

replies(1): >>41853882 #

QuadmasterXLII ◴[15 Oct 24 22:42 UTC] No.41853882[source]▶

>>41853866 #

clang 19, -O3 -ffast-math -march=native

replies(1): >>41853962 #

mshockwave ◴[15 Oct 24 22:56 UTC] No.41853962[source]▶

>>41853882 #

can confirm fast math makes the biggest difference

replies(2): >>41854114 #>>41854399 #

1. mkristiansen ◴[15 Oct 24 23:19 UTC] No.41854114[source]▶

>>41853962 #

Fast Math basically means "who cares about standards just add in whatever order you want" :)

↑

A not so fast implementation of cosine similarity in C++ and SIMD