One prompt I use for testing is: "Using three.js, render a spinning donut with gl.TRIANGLE_STRIP". The catch here is that three.js doesn't support TRIANGLE_STRIP for architectural reasons[1]. Before I knew this, I got confused as to why all the AIs kept failing and gaslighting me about using TRIANGLE_STRIP. If the AI fails to tell the user that this is an impossible task, then it has failed the test. So far, I haven't found an AI that can determine that the request isn't valid.
[1] https://discourse.threejs.org/t/is-there-really-no-way-to-us...