Relative to the price of a standard node, FPGA's aren't magic : You have to find the parallelism in order to exploit it. As for custom silicon, anything on a close to a modern process costs millions in NRE alone.
From a different perspective, think about supercomputers - many supercomputers do indeed do relatively specific things (and I would assume some do run custom hardware), but the magic is in the interconnects - getting the data around effectively is where the black magic is.
Also, if you aren't particularly time bound, why bother? FPGAs require completely different types of engineers, and are generally a bit of pain to program for even ignoring how horrific some vendor tools are - your GPU code won't fail timing for example.