In the mug-scrubbing video, the person clearly pretends to wash the cup but does not seem to want to get their hands wet anyway. I'm curious as to when models can figure out that subtle thing.
It's all probabilistic, my guess. I.e. model produces probabilities for a set of actions from the same video. Even pretended action may look more like it than anything else. Thus getting higher probability.