(joseprupi.github.io)

66 points melenaboija | 3 comments | 14 Oct 24 21:43 UTC | HN request time: 0.719s | source

Show context

QuadmasterXLII ◴[15 Oct 24 19:57 UTC] No.41852483[source]▶

I’m really surprised by the performance of the plain C++ version. Is automatic vectorization turned off? Frankly this task is so common that I would half expect compilers to have a hard coded special case specifically for fast dot products

Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.

replies(1): >>41853866 #

mshockwave ◴[15 Oct 24 22:39 UTC] No.41853866[source]▶

>>41852483 #

which flags did you use and which compiler version?

replies(1): >>41853882 #

QuadmasterXLII ◴[15 Oct 24 22:42 UTC] No.41853882[source]▶

>>41853866 #

clang 19, -O3 -ffast-math -march=native

replies(1): >>41853962 #

mshockwave ◴[15 Oct 24 22:56 UTC] No.41853962[source]▶

>>41853882 #

can confirm fast math makes the biggest difference

replies(2): >>41854114 #>>41854399 #

1. QuadmasterXLII ◴[16 Oct 24 00:14 UTC] No.41854399[source]▶

>>41853962 #

I feel like I’m kinda being the bad aunt by encouraging -ffast-math. It can definitely break some things (i.e. https://pspdfkit.com/blog/2021/understanding-fast-math/ ) but I use it habitually and I’m fine so clearly it’s safe.

replies(1): >>41854704 #

2. magicalhippo ◴[16 Oct 24 01:11 UTC] No.41854704[source]▶

>>41854399 (TP) #

> It can definitely break some things

I recall it totally fudged up the ray-axis aligned bounding box intersection routine in the raytracer I worked on. The routine relied on infinities being handled correctly, and -ffast-math broke that.

I see the linked article goes into that aspect in detail, wish I had it back then.

IIRC we ended up disabling it for just that file, as it did speed up the rest my a fair bit.

replies(1): >>41859169 #

3. hansvm ◴[16 Oct 24 14:05 UTC] No.41859169[source]▶

>>41854704 #

I would love a fast-math implementation which handled inf correctly, but no language/compiler seems to care.

↑

A not so fast implementation of cosine similarity in C++ and SIMD