This isn't actually true. You could do this 20 years ago on a consumer laptop, and you don't need the information you get for free from text moving under a filter either.
What you need is the ability to reproduce the conditions the image was generated and pixelated/blurred under. If the pixel radius only encompasses, say, 4 characters, then you only need to search for those 4 characters first. And then you can proceed to the next few characters represented under the next pixelated block.
You can think of pixelation as a bad hash which is very easy to find a preimage for.
No motion necessary. No AI necessary. No machine learning necessary.
The hard part is recreating the environment though, and AI just means you can skip having that effort and know-how.