You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.
The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.