Claude for Chrome

(www.anthropic.com)

795 points davidbarker | 1 comments | 26 Aug 25 19:01 UTC | HN request time: 0.427s | source

Show context

parsabg ◴[26 Aug 25 20:24 UTC] No.45031888[source]▶

I built a very similar extension [1] a couple of months ago that supports a wide range of models, including Claude, and enables them to take control of a user's browser using tools for mouse and keyboard actions, observation, etc. It's a fun little project to look at to understand how this type of thing works.

It's clear to me that the tech just isn't there yet. The information density of a web page with standard representations (DOM, screenshot, etc) is an order of magnitude lower than that of, say, a document or piece of code, which is where LLMs shine. So we either need much better web page representations, or much more capable models, for this to work robustly. Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly. Dia, Comet, Browser Use, Gemini, etc are all attacking this and have big incentives to crack it, so we should expect decent progress here.

A funny observation was that some models have been clearly fine tuned for web browsing tasks, as they have memorized specific selectors (e.g. "the selector for the search input in google search is `.gLFyf`").

[1] https://github.com/parsaghaffari/browserbee

replies(11): >>45032377 #>>45032556 #>>45032983 #>>45033328 #>>45033344 #>>45033797 #>>45033828 #>>45035580 #>>45036238 #>>45037152 #>>45040560 #

adam_arthur ◴[26 Aug 25 22:55 UTC] No.45033344[source]▶

>>45031888 #

The LLM should not be seeing the raw DOM in its context window, but a highly simplified and compact version of it.

In general LLMs perform worse both when the context is larger and also when the context is less information dense.

To achieve good performance, all input to the prompt must be made as compact and information dense as possible.

I built a similar tool as well, but for automating generation of E2E browser tests.

Further, you can have sub-LLMs help with compacting aspects of the context prior to handing it off to the main LLM. (Note: it's important that, by design, HTML selectors cannot be hallucinated)

Modern LLMs are absolutely capable of interpreting web pages proficiently if implemented well.

That being said, things like this Claude product seem to be fundamentally poorly designed from both a security and general approach perspective and I don't agree at all that prompt engineering is remotely the right way to remediate this.

There are so many companies pushing out junk products where the AI is just handling the wrong part of the loop and pulls in far too much context to perform well.

replies(3): >>45033640 #>>45033809 #>>45034673 #

1. felarof ◴[26 Aug 25 23:57 UTC] No.45033809[source]▶

>>45033344 #

> The LLM should not be seeing the raw DOM in its context window, but a highly simplified and compact version of it.

Precisely! There is already something accessibility tree that Chromium rendering engine constructs which is a semantically meaningful version of the DOM.

This is what we use at BrowserOS.com

↑