how does this work ? Aren't the whatsapp data encrypted locally ?
You have to unencrypt data to process it and as soon as you do that, the right Kernel APIs are enough to see whatever you want -- here the accessibility APIs are probably enough to read any text you would be able to read.
another person in the thread suggests it's working over a screen capture stream. But that's what i'm wondering : are they working over a video of the screen or by integrating directly with the internals of the OS.