Edit: oops that’s what you did too. Yes most MCP shouldn’t be used.
Instead of sending accessibility tree snapshots on every action, Claude just writes Playwright code and runs it. You get back screenshots and console output. That's it.
314 lines of instructions vs a persistent MCP server. Full API docs only load if Claude needs them.
Same browser automation, way less overhead. Works as a Claude Code plugin or manual install.
Token limit issue: https://github.com/microsoft/playwright-mcp/issues/889
Claude Skills docs: https://docs.claude.com/en/docs/claude-code/skills
Edit: oops that’s what you did too. Yes most MCP shouldn’t be used.
This might be sufficient for an independent contractor or student. It shouldn't be used in a production agent.
And for privacy screenshots stay local in /tmp, but console output and page content do go to Claude/Anthropic. It’s designed for dev environments with dummy data, not prod. Same deal as using Claude for any coding help.
1)The examples always seem very generic: "Test Login Functionality, check if search works, etc". Do these actually work well at all once you step outside of the basic smoketest use cases?
2) How to you prevent proprietary data from being read when you are just foisting snapshots over to the AI provider? There's no way I'd be able to use this in any kind of real application where data privacy is a constraint.
1) Beyond basic tests: You're right to be skeptical. This is best for quick exploratory testing during local development ("does my new feature work?"), not replacing your test suite. Think "scriptable manual testing" - faster than writing Playwright manually, but not suitable for comprehensive CI/CD test coverage.
2) Data privacy: Screenshots stay local in /tmp, but console output and page content Claude writes tests against are sent to Anthropic. This is a local dev tool for testing with dummy data, not for production environments with real user data. Same privacy model as any AI coding assistant - if you wouldn't show Claude your production database, don't test against it with this.
Using Claude Code I'll often prompt something like this:
"Start a python -m http.server on port 8003 and then use Playwright Python to exercise this UI, there's a console error when you click the button, click it and then read that error and then fix it and demonstrate the fix"
This works really well even without adding an extra skill.
I think one of the hardest parts of skill development is figuring out what to put in the skill that produces better results than the model acting alone.
Have you tried iteratively testing the skill - building it up part by part and testing along the way to see if the different sections genuinely help improve the model's performance?
Related anecdote: some months ago I tried to coax the Playwright MCP to do a full page screenshot and it couldn't do it. Then I just told Claude Code to write a Playwright JS script to do that and it worked at the first try.
Taking into account all the tools crap that the Playwright MCP puts in your context window and the final result I think this is the way to go.
I did test by comparing transcripts across sessions to refine the workflow. As I'm running into new things I'm continuing to do that.
Also the problem with the LLM being trained to use foo tool 1.0 and now foo tool is on version 2.0.
The nice thing is that scripts on a skill are not included in the context and also they are deterministic.
Excellent question... no, beyond basic kindergarten stuff playwright (with AI) falls quickly apart. Have some OAuth? Good luck configuring playwright for your exact setup. Need to synthesize all information available from logs and visuals to debug something? Good luck..
Then nVidia's moat begins to shrink because they need to offer their GPUs at a somewhat reduced price to try to keep their majority share.
edit with one more thought: In many ways this mirrors building/adopting dev tooling to help your (human) junior engineers, and that still feels like the good metaphor for working with coding agents. It's extremely context dependent and murky to evaluate whether a new tool is effective -- you usually just have to try it out.
MCPs themselves may provide access to tools that are either deterministic or not, but the LLM using them generally isn't deterministic, so when used by an LLM as part of the request-response cycle determinism, if the MCP-provided tool had it, is not in a feature of the overall system.
SKILLS.md relies on a deterministic code execution environment, but has the same issue. I'm not seeing a broad difference in kind here when used in the context of an LLM response generation cycle, and that’s really the only context where both are usable (MCP could be used for non-LLM integration, but that doesn't seem relevant.)
Recently, I have found myself getting more interested in shell commands than MCPs. There is no need to set it up. Debugging is far easier. And I would be free to use whichever model I like ot use for a specific function. For example, for Playwright, I use GPT-5 just because I have free credits. I could save my Claude Code Quota for more important tasks.
I was looking into creating one and skimmed the available ones and didn't see it.
EDIT:
Just looked again. In the docs they have this section: ``` Available Skills
Pre-built Agent Skills The following pre-built Agent Skills are available for immediate use:
PowerPoint (pptx): Create presentations, edit slides, analyze presentation content
Excel (xlsx): Create spreadsheets, analyze data, generate reports with charts
Word (docx): Create documents, edit content, format text
PDF (pdf): Generate formatted PDF documents and reports
These Skills are available on the Claude API and claude.ai. See the quickstart tutorial to start using them in the API.
```Is there another list of available skills?
This is the skill creation one: https://github.com/anthropics/skills/blob/main/skill-creator...
You can turn on additional skills in the Claude UI from this page: https://claude.ai/settings/capabilities
The agentic system that uses MCP (e.g., an LLM) is fundamentally non-deterministic. The LLM's decision of which tool to call, when to call it, and what to do with the response is stochastic.
When it works, its totally magic, but I find it gets hung up on things like not finding the active playwright window or being able to identify elements on the screen.
If you're willing to have Claude write code to test a thing you could do a teeny bit more work and make that Playwright script a permanent part of your codebase. Then, the script can run in your CI on every build, and you can keep enhancing it as your product changes so it keeps proving that area of your product works as desired.
Have it run inside a harness that spins up its own server & stack (DB etc.) and boom - you now have an end-to-end test suite!
The issue is for many things playwright is really verbose, by better tailoring outputs and making them more fine grained you’ll get less context bloat and allow the llm to better work with the context. I’m making it open source.