I had a look at the assembly generated, both in your repo, and from https://godbolt.org/z/76K1eacsG
if you look at the assembly generated:
vmovups ymm3, ymmword ptr [rdi + 4*rcx]
vmovups ymm4, ymmword ptr [rsi + 4*rcx]
add rcx, 8
vfmadd231ps ymm2, ymm3, ymm4
vfmadd231ps ymm1, ymm3, ymm3
vfmadd231ps ymm0, ymm4, ymm4
cmp rcx, rax
jb .LBB0_10
jmp .LBB0_2
you are only using 5 of the sse2 registers(ymm0 -- ymm4) before creating a dependency on one of the (ymm0 -- ymm2) being used for the results.I Wonder if widening your step size to contain more than one 256bit register might get you the speed up. Something like this (https://godbolt.org/z/GKExaoqcf) to get more of the sse2 registers in your CPU doing working.
vmovups ymm6, ymmword ptr [rdi + 4*rcx]
vmovups ymm8, ymmword ptr [rsi + 4*rcx]
vmovups ymm7, ymmword ptr [rdi + 4*rcx + 32]
vmovups ymm9, ymmword ptr [rsi + 4*rcx + 32]
add rcx, 16
vfmadd231ps ymm5, ymm6, ymm8
vfmadd231ps ymm4, ymm7, ymm9
vfmadd231ps ymm3, ymm6, ymm6
vfmadd231ps ymm2, ymm7, ymm7
vfmadd231ps ymm1, ymm8, ymm8
vfmadd231ps ymm0, ymm9, ymm9
cmp rcx, rax
jb .LBB0_10
jmp .LBB0_2
which ends up using 10 of the registers, allowing for 6 fused multiplies, rather than 3, before creating a dependency on a previous result -- you might be able to create a longer list.Again -- This was a really interesting writeup :)