←back to thread

Claude for Chrome

(www.anthropic.com)
795 points davidbarker | 3 comments | | HN request time: 0s | source
Show context
parsabg ◴[] No.45031888[source]
I built a very similar extension [1] a couple of months ago that supports a wide range of models, including Claude, and enables them to take control of a user's browser using tools for mouse and keyboard actions, observation, etc. It's a fun little project to look at to understand how this type of thing works.

It's clear to me that the tech just isn't there yet. The information density of a web page with standard representations (DOM, screenshot, etc) is an order of magnitude lower than that of, say, a document or piece of code, which is where LLMs shine. So we either need much better web page representations, or much more capable models, for this to work robustly. Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly. Dia, Comet, Browser Use, Gemini, etc are all attacking this and have big incentives to crack it, so we should expect decent progress here.

A funny observation was that some models have been clearly fine tuned for web browsing tasks, as they have memorized specific selectors (e.g. "the selector for the search input in google search is `.gLFyf`").

[1] https://github.com/parsaghaffari/browserbee

replies(11): >>45032377 #>>45032556 #>>45032983 #>>45033328 #>>45033344 #>>45033797 #>>45033828 #>>45035580 #>>45036238 #>>45037152 #>>45040560 #
asdff ◴[] No.45033328[source]
It is kind of funny how the systems are set up where there often is dense and queryable information out there already for a lot of these tasks, but these are ignored in favor of the difficult challenge of brute forcing the human consumer facing ui instead of some existing api that is designed to be machine readable already. E.g. booking flights. Travel agents use software that queries all the airlines ticket inventory to return flight information to you the consumer. The issue of booking a flight is theoretically solved already by virtue of these APIs that already exist to do just that. But for AI agents this is now a stumbling block because it would presumably take a little bit of time to craft out a rule to cover this edge case and return far more accurate information and results. Consumers with no alternative don't know what they are missing so there is no incentive to improve this.
replies(6): >>45033728 #>>45034068 #>>45034115 #>>45034274 #>>45034337 #>>45034796 #
1. makeitdouble ◴[] No.45034796[source]
This was the Rabbit R1's connundrum. Uber/DoorDash/Spotify have APIs for external integration, but they require business deals and negociations.

So how to evade talking to the service's business people ? Provide a chain of Rube Goldberg machines to somewhat use these services as if it was the user. It can then be touted as flexibility, and blame the state of technology when it inevitably breaks, if it even worked in the first place.

replies(1): >>45046745 #
2. digitaltrees ◴[] No.45046745[source]
This is definitely true but there are more reasons that explain why so many teams choose the seemingly irrational path. First, so many APIs are designed differently, so even if you decide the business negotiation is worth it you have development work ahead. Second, tons of vendors don’t even have an API. So the thought of building a tool once is appealing
replies(1): >>45049389 #
3. makeitdouble ◴[] No.45049389[source]
Those are of course valid points. The counterpart being that a vendor might not have an API because they actively don't want to (Twitter/X for instance...), and when they have one, clients trying to circumvent their system to basically scrape the user UX won't be welcomed either.

So most of the time that path of "build a tool once" will be adversarial towards the service, which will be incentivized to actively kill your ad-hoc integration if they can without too much collateral damage.