←back to thread

Claude for Chrome

(www.anthropic.com)
795 points davidbarker | 7 comments | | HN request time: 0.725s | source | bottom
Show context
parsabg ◴[] No.45031888[source]
I built a very similar extension [1] a couple of months ago that supports a wide range of models, including Claude, and enables them to take control of a user's browser using tools for mouse and keyboard actions, observation, etc. It's a fun little project to look at to understand how this type of thing works.

It's clear to me that the tech just isn't there yet. The information density of a web page with standard representations (DOM, screenshot, etc) is an order of magnitude lower than that of, say, a document or piece of code, which is where LLMs shine. So we either need much better web page representations, or much more capable models, for this to work robustly. Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly. Dia, Comet, Browser Use, Gemini, etc are all attacking this and have big incentives to crack it, so we should expect decent progress here.

A funny observation was that some models have been clearly fine tuned for web browsing tasks, as they have memorized specific selectors (e.g. "the selector for the search input in google search is `.gLFyf`").

[1] https://github.com/parsaghaffari/browserbee

replies(11): >>45032377 #>>45032556 #>>45032983 #>>45033328 #>>45033344 #>>45033797 #>>45033828 #>>45035580 #>>45036238 #>>45037152 #>>45040560 #
1. felarof ◴[] No.45033797[source]
Just dumping the raw DOM into the LLM context is brutal on token usage. We've seen pages that eat up 60-70k tokens when you include the full DOM plus screenshots, which basically maxes out your context window before you even start doing anything useful.

We've been working on this exact problem at https://github.com/browseros-ai/BrowserOS. Instead of throwing the entire DOM at the model, we hook into Chromium's rendering engine to extract a cleaner representation of what's actually on the page. Our browser agents work with this cleaned-up data, which makes the whole interaction much more efficient.

replies(5): >>45034412 #>>45034593 #>>45036054 #>>45036065 #>>45038003 #
2. commanderkeen08 ◴[] No.45034412[source]
Playwrights MCP went had a strong idea to default to the accessibility tree instead of DOM. Unfortunately, even that is pretty chonky.
3. apitman ◴[] No.45034593[source]
Maybe people will start making simpler/smaller websites in order to work better with AI tools. That would be nice.
replies(1): >>45035750 #
4. pishpash ◴[] No.45035750[source]
You just need to capture the rendering and represent that.
5. edg5000 ◴[] No.45036054[source]
It could work simmilar to Claude Code right? Where it won't ingest the entire codebase, rather search for certain strings or start looking at a directed location and follow references from there. Indeed it seems infeasible to ingest the whole thing.
6. kodefreeze ◴[] No.45036065[source]
This is really interesting. We've been working on a smaller set of this problem space. We've also found in some cases you need to somehow pass to the model the sequence of events that happen (like a video of a transition).

For instance, we were running a test case on a e commerce website and they have a random popup that used to come up after initial Dom was rendered but before action could be taken. This would confuse the LLM for the next action it needed to take because it didn't know the pop-up came up.

7. ◴[] No.45038003[source]