* first you do it in the cpu * then you do it in a dedicated card off the bus * then you find the FPGA or whatever too slow, so you make the card have it's own CPU * then you wind up recursing over the problem, implementing some logic in a special area of the cpu, to optimise its bus to the other bus to ...
I expect to come back in 10years and find there is a chiplet which took the rpi core, and implements a shrunk version which can be reprogrammed, into the chiplet on the offload card, so we can program terabit network drivers with a general purpose CPU model.