←back to thread

Claude for Chrome

(www.anthropic.com)
795 points davidbarker | 1 comments | | HN request time: 0.255s | source
Show context
parsabg ◴[] No.45031888[source]
I built a very similar extension [1] a couple of months ago that supports a wide range of models, including Claude, and enables them to take control of a user's browser using tools for mouse and keyboard actions, observation, etc. It's a fun little project to look at to understand how this type of thing works.

It's clear to me that the tech just isn't there yet. The information density of a web page with standard representations (DOM, screenshot, etc) is an order of magnitude lower than that of, say, a document or piece of code, which is where LLMs shine. So we either need much better web page representations, or much more capable models, for this to work robustly. Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly. Dia, Comet, Browser Use, Gemini, etc are all attacking this and have big incentives to crack it, so we should expect decent progress here.

A funny observation was that some models have been clearly fine tuned for web browsing tasks, as they have memorized specific selectors (e.g. "the selector for the search input in google search is `.gLFyf`").

[1] https://github.com/parsaghaffari/browserbee

replies(11): >>45032377 #>>45032556 #>>45032983 #>>45033328 #>>45033344 #>>45033797 #>>45033828 #>>45035580 #>>45036238 #>>45037152 #>>45040560 #
1. bboygravity ◴[] No.45032377[source]
I'm trying to build an automatic form filler (not just web-forms, any form) and I believe the secret lies in just chaining a whole bunch of LLM, OCR, form understanding and other API's together to get there.

Just 1 LLM or agent is not going to cut it at the current state of art. Just looking at the DOM/clientside source doesn't work, because you're basically asking the LLM to act like a browser and redo the website rendering that the browser already does better (good luck with newer forms written in Angular bypassing the DOM). IMO the way to go is have the toolchain look at the forms/websites in the same way humans do (purely visually AFTER the rendering was done) and take it from there.

Source: I tried to feed web source into LLMs and ask them to fill out forms (firefox addon), but webdevs are just too creative in the millions of ways they can ask for a simple freaking address (for example).

Super tricky anyway, but there's no more annoying API than manually filling out forms, so worth the effort hopefully.