Whether the market segment (that could utilize that much special sauce effectively enough to be worth the software engineering) would be large enough to warrant the hardware design and bespoke silicon (which such a project entails)......
I'd probably favor spending the silicon on scatter/gather or maybe some way to span a large gap between calculating an address and using the value fetched from that address, so prefetching wouldn't need to re-calculate the address (expensive) or block of a GPR with the address (precious resource). Also could make load atomicity happen anytime between the address provision (/prefetch-request) and load-completing (destination data register provision).
Prior art: recent (iirc it came with H100) Nvidia async memcpy directly from global to "shared" (user-managed partition of L1D$) memory bypassing the register file.