I had a look at the assembly generated, both in your repo, and from https://godbolt.org/z/76K1eacsG
if you look at the assembly generated:
        vmovups ymm3, ymmword ptr [rdi + 4*rcx]
        vmovups ymm4, ymmword ptr [rsi + 4*rcx]
        add     rcx, 8
        vfmadd231ps     ymm2, ymm3, ymm4
        vfmadd231ps     ymm1, ymm3, ymm3
        vfmadd231ps     ymm0, ymm4, ymm4
        cmp     rcx, rax
        jb      .LBB0_10
        jmp     .LBB0_2
I Wonder if widening your step size to contain more than one 256bit register might get you the speed up. Something like this (https://godbolt.org/z/GKExaoqcf) to get more of the sse2 registers in your CPU doing working.
        vmovups ymm6, ymmword ptr [rdi + 4*rcx]
        vmovups ymm8, ymmword ptr [rsi + 4*rcx]
        vmovups ymm7, ymmword ptr [rdi + 4*rcx + 32]
        vmovups ymm9, ymmword ptr [rsi + 4*rcx + 32]
        add     rcx, 16
        vfmadd231ps     ymm5, ymm6, ymm8
        vfmadd231ps     ymm4, ymm7, ymm9
        vfmadd231ps     ymm3, ymm6, ymm6
        vfmadd231ps     ymm2, ymm7, ymm7
        vfmadd231ps     ymm1, ymm8, ymm8
        vfmadd231ps     ymm0, ymm9, ymm9
        cmp     rcx, rax
        jb      .LBB0_10
        jmp     .LBB0_2
Again -- This was a really interesting writeup :)