Most active commenters
  • asdff(4)
  • miguelspizza(3)

←back to thread

Claude for Chrome

(www.anthropic.com)
795 points davidbarker | 51 comments | | HN request time: 0.678s | source | bottom
1. parsabg ◴[] No.45031888[source]
I built a very similar extension [1] a couple of months ago that supports a wide range of models, including Claude, and enables them to take control of a user's browser using tools for mouse and keyboard actions, observation, etc. It's a fun little project to look at to understand how this type of thing works.

It's clear to me that the tech just isn't there yet. The information density of a web page with standard representations (DOM, screenshot, etc) is an order of magnitude lower than that of, say, a document or piece of code, which is where LLMs shine. So we either need much better web page representations, or much more capable models, for this to work robustly. Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly. Dia, Comet, Browser Use, Gemini, etc are all attacking this and have big incentives to crack it, so we should expect decent progress here.

A funny observation was that some models have been clearly fine tuned for web browsing tasks, as they have memorized specific selectors (e.g. "the selector for the search input in google search is `.gLFyf`").

[1] https://github.com/parsaghaffari/browserbee

replies(11): >>45032377 #>>45032556 #>>45032983 #>>45033328 #>>45033344 #>>45033797 #>>45033828 #>>45035580 #>>45036238 #>>45037152 #>>45040560 #
2. bboygravity ◴[] No.45032377[source]
I'm trying to build an automatic form filler (not just web-forms, any form) and I believe the secret lies in just chaining a whole bunch of LLM, OCR, form understanding and other API's together to get there.

Just 1 LLM or agent is not going to cut it at the current state of art. Just looking at the DOM/clientside source doesn't work, because you're basically asking the LLM to act like a browser and redo the website rendering that the browser already does better (good luck with newer forms written in Angular bypassing the DOM). IMO the way to go is have the toolchain look at the forms/websites in the same way humans do (purely visually AFTER the rendering was done) and take it from there.

Source: I tried to feed web source into LLMs and ask them to fill out forms (firefox addon), but webdevs are just too creative in the millions of ways they can ask for a simple freaking address (for example).

Super tricky anyway, but there's no more annoying API than manually filling out forms, so worth the effort hopefully.

3. threatofrain ◴[] No.45032556[source]
> Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly.

The DOM is merely inexpensive, but obviously the answer can't be solely in the DOM but in the visual representation layer because that's the final presentation to the user's face.

Also the DOM is already the subject of cat and mouse games, this will just add a new scale and urgency to the problem. Now people will be putting fake content into the DOM and hiding content in the visual layer.

replies(2): >>45032675 #>>45033818 #
4. jonplackett ◴[] No.45032675[source]
It also surely leaves more room for prompt injection that the user can’t see
5. miguelspizza ◴[] No.45032983[source]
> It's clear to me that the tech just isn't there yet.

Totally agree. This was the thesis behind MCP-B (now WebMCP https://github.com/MiguelsPizza/WebMCP)

HN Post: https://news.ycombinator.com/item?id=44515403

DOM and visual parsing are dead ends for browser automation. Not saying models are bad; they are great. The web is just not designed for them at all. It's designed for humans, and humans, dare I say, are pretty impressive creatures.

Providing an API contract between extensions and websites via MCP allows an AI to interact with a website as a first-class citizen. It just requires buy-in from website owners.

It's being proposed as a web standard: > https://github.com/webmachinelearning/webmcp

replies(2): >>45033055 #>>45033221 #
6. shermantanktop ◴[] No.45033055[source]
> humans, dare I say, are pretty impressive creatures

Damn straight. Humanism in the age of tech obsession seems to be contrarian. But when it takes billions of dollars to match a 5 year-old’s common sense, maybe we should be impressed by the 5 year old. They are amazing.

7. chatmasta ◴[] No.45033221[source]
I suspect this kind of framework will be adopted by websites with income streams that are not dependent on human attention (i.e. advertising revenue, mostly). They have no reason to resist LLM browser agents. But if they’re in the business of selling ads to human eyeballs, expect resistance.

Maybe the AI companies will find a way to resell the user’s attention to the website, e.g. “you let us browse your site with an LLM, and we’ll show your ad to the user.”

replies(2): >>45033387 #>>45033539 #
8. asdff ◴[] No.45033328[source]
It is kind of funny how the systems are set up where there often is dense and queryable information out there already for a lot of these tasks, but these are ignored in favor of the difficult challenge of brute forcing the human consumer facing ui instead of some existing api that is designed to be machine readable already. E.g. booking flights. Travel agents use software that queries all the airlines ticket inventory to return flight information to you the consumer. The issue of booking a flight is theoretically solved already by virtue of these APIs that already exist to do just that. But for AI agents this is now a stumbling block because it would presumably take a little bit of time to craft out a rule to cover this edge case and return far more accurate information and results. Consumers with no alternative don't know what they are missing so there is no incentive to improve this.
replies(6): >>45033728 #>>45034068 #>>45034115 #>>45034274 #>>45034337 #>>45034796 #
9. adam_arthur ◴[] No.45033344[source]
The LLM should not be seeing the raw DOM in its context window, but a highly simplified and compact version of it.

In general LLMs perform worse both when the context is larger and also when the context is less information dense.

To achieve good performance, all input to the prompt must be made as compact and information dense as possible.

I built a similar tool as well, but for automating generation of E2E browser tests.

Further, you can have sub-LLMs help with compacting aspects of the context prior to handing it off to the main LLM. (Note: it's important that, by design, HTML selectors cannot be hallucinated)

Modern LLMs are absolutely capable of interpreting web pages proficiently if implemented well.

That being said, things like this Claude product seem to be fundamentally poorly designed from both a security and general approach perspective and I don't agree at all that prompt engineering is remotely the right way to remediate this.

There are so many companies pushing out junk products where the AI is just handling the wrong part of the loop and pulls in far too much context to perform well.

replies(3): >>45033640 #>>45033809 #>>45034673 #
10. onesociety2022 ◴[] No.45033387{3}[source]
Even the websites whose primary source of revenue is not ad impressions might be resistant to let the agents be the primary interface through which users interact with their service.

Instacart currently seems to be very happy to let ChatGPT Operator use its website to place an order (https://www.instacart.com/company/updates/ordering-groceries...) [1]. But what happens when the primary interface for shopping with Instacart is no longer their website or their mobile app? OpenAI could demand a huge take rate for orders placed via ChatGPT agents, and if they don't agree to it, ChatGPT can strike a deal with a rival company and push traffic to that service instead. I think Amazon is never going to agree to let other agents use its website for shopping for the same reason (they will restrict it to just Alexa).

[1] - the funny part is the Instacart CEO quit shortly after this and joined OpenAI as CEO of Applications :)

replies(1): >>45033512 #
11. miguelspizza ◴[] No.45033512{4}[source]
The side-panel browser agent is a good middle ground to this issue. The user is still there looking at the website via their own browser session, the AI just has access to the specific functionality which the website wants to expose to it. The human can take over or stop the AI if things are going south.
12. miguelspizza ◴[] No.45033539{3}[source]
The Primary client for WebMCP enabled websites is a chrome extension like Claude Chrome. So the human is still there in the loop looking at the screen. MCP also supports things like elicitation so the website could stop the model and request human input/attention
13. antves ◴[] No.45033640[source]
This is exactly it! We built a browser agent and got awesome results by designing the context in a simplified/compact version + using small/efficient LLMs - it's smooth.sh if you'd like to try
14. ambicapter ◴[] No.45033728[source]
Those APIs aren't generally available to the public, are they?
replies(1): >>45033906 #
15. felarof ◴[] No.45033797[source]
Just dumping the raw DOM into the LLM context is brutal on token usage. We've seen pages that eat up 60-70k tokens when you include the full DOM plus screenshots, which basically maxes out your context window before you even start doing anything useful.

We've been working on this exact problem at https://github.com/browseros-ai/BrowserOS. Instead of throwing the entire DOM at the model, we hook into Chromium's rendering engine to extract a cleaner representation of what's actually on the page. Our browser agents work with this cleaned-up data, which makes the whole interaction much more efficient.

replies(5): >>45034412 #>>45034593 #>>45036054 #>>45036065 #>>45038003 #
16. felarof ◴[] No.45033809[source]
> The LLM should not be seeing the raw DOM in its context window, but a highly simplified and compact version of it.

Precisely! There is already something accessibility tree that Chromium rendering engine constructs which is a semantically meaningful version of the DOM.

This is what we use at BrowserOS.com

17. mikepurvis ◴[] No.45033818[source]
I had the same thought that really an LLM should interact with a browser viewport and just leverage normal accessibility features like tabbing between form fields and links, etc.

Basically the LLM sees the viewport as a thumbnail image and goes “That looks like the central text, read that” and then some underlying skill implementation selects and returns the textual context from the viewport.

18. dotproto ◴[] No.45033828[source]
Just took a quick glance at your extension and observed that it's currently using the "debugger" permission. What features necessitated using this API rather than leveraging content scripts and less invasive WebExtensions APIs?
19. asdff ◴[] No.45033906{3}[source]
Not always, but anthropic is not exactly the public either.
20. dudeWithAMood ◴[] No.45034068[source]
Dude you do not understand how bad those "APIs" are for booking flights. Customers of Travelport often have screen reading software that reads/writes to a green screen. There's also tele-type, but like most of the GDS providers use old IBM TPF mainframes.

I spent the first two years of my career in the space, we joked anything invented post Michael Jackson's song Thriller wasn't present.

replies(2): >>45034316 #>>45047476 #
21. zukzuk ◴[] No.45034115[source]
This is a massive problem in healthcare, at least here in Canada. Most of the common EMRs doctors and other practitioners use either don’t have APIs, or if APIs exist they are closely guarded by the EMR vendors. And EMRs are just one of the many software tools clinics have to juggle.

I’d argue that lack of interoperability is one of the biggest problems in the healthcare system here, and getting access to data through the UI intended for humans might just end up being the only feasible solution.

replies(2): >>45036548 #>>45047469 #
22. darepublic ◴[] No.45034274[source]
It's because of legacy systems and people who basically have a degenerate attitude toward user interface/ user experience. They see job security in a friction heavy process. Hence the "brute forcing".. easier that than appealing to human nature
23. cicloid ◴[] No.45034316{3}[source]
Somewhere in the world there is someone crying while using QIK…
24. shswkna ◴[] No.45034337[source]
To add to this, it is even funnier how travel agents undergo training in order to be able to interface with and operate the “machine readable“ APIs for booking flight tickets.

What a paradoxical situation now emerges, where human travel agents still need to train for the machine interface, while AI agents are now being trained to take over the human jobs by getting them to use the consumer interfaces (aka booking websites) available to us.

replies(1): >>45036747 #
25. commanderkeen08 ◴[] No.45034412[source]
Playwrights MCP went had a strong idea to default to the accessibility tree instead of DOM. Unfortunately, even that is pretty chonky.
26. apitman ◴[] No.45034593[source]
Maybe people will start making simpler/smaller websites in order to work better with AI tools. That would be nice.
replies(1): >>45035750 #
27. tempestn ◴[] No.45034673[source]
Is it just me, or do both of my sibling comments pitching competing AI projects read like they're written by (the same underlying) AI?
replies(1): >>45035426 #
28. makeitdouble ◴[] No.45034796[source]
This was the Rabbit R1's connundrum. Uber/DoorDash/Spotify have APIs for external integration, but they require business deals and negociations.

So how to evade talking to the service's business people ? Provide a chain of Rube Goldberg machines to somewhat use these services as if it was the user. It can then be touted as flexibility, and blame the state of technology when it inevitably breaks, if it even worked in the first place.

replies(1): >>45046745 #
29. bergie3000 ◴[] No.45035426{3}[source]
You're exactly right! I see the problem now.
replies(1): >>45037678 #
30. Exoristos ◴[] No.45035580[source]
Do we regret, yet, letting the Semantic Web wither on the vine?
replies(3): >>45035738 #>>45035761 #>>45037155 #
31. worthless-trash ◴[] No.45035738[source]
/s no, because if it doesn't help people consume it is its NOT important.
32. pishpash ◴[] No.45035750{3}[source]
You just need to capture the rendering and represent that.
33. pishpash ◴[] No.45035761[source]
You might get it when bots write pages.
34. edg5000 ◴[] No.45036054[source]
It could work simmilar to Claude Code right? Where it won't ingest the entire codebase, rather search for certain strings or start looking at a directed location and follow references from there. Indeed it seems infeasible to ingest the whole thing.
35. kodefreeze ◴[] No.45036065[source]
This is really interesting. We've been working on a smaller set of this problem space. We've also found in some cases you need to somehow pass to the model the sequence of events that happen (like a video of a transition).

For instance, we were running a test case on a e commerce website and they have a random popup that used to come up after initial Dom was rendered but before action could be taken. This would confuse the LLM for the next action it needed to take because it didn't know the pop-up came up.

36. hinoki ◴[] No.45036238[source]
How do screen readers work? I’ve used all the aria- attributes to make automation/scraping hopefully more robust, but don’t have experience beyond that. Could accessibility attributes also help condense the content into something more manageable?
37. j45 ◴[] No.45036548{3}[source]
I’m not sure how unique or a new problem this is first individually to me and then generally.

Automation technologies to handle things like UI automation have existed long before LLMs and work quite fine.

Having an intentionally imprecise and non deterministic software try to behave in a deterministic manner like all software we’re used to is something else.

replies(2): >>45037975 #>>45046968 #
38. originalvichy ◴[] No.45036747{3}[source]
This is exactly the conversation I had with a colleague of mine. They were excited about how LLMs can help people interact with data and visualize it nicely, but I just had to ask - with as little snark as possible - if this wasn't what a monitor and a UI were already doing? It seems like these LLMs are being used as the cliche "hammer that solves all the problems" where problems didn't even exist. Just because we are excited about how an LLM can chew through formatted API data (which is hard for humans to read) doesn't mean that we didn't already solve this with UIs displaying this data.

I don't know why people want to turn the internet into a turn-based text game. The UI is usually great.

replies(1): >>45038316 #
39. aminkhorrami ◴[] No.45037152[source]
Super cool
40. mike_hearn ◴[] No.45037155[source]
It didn't really wither on the vine, it just moved to JSON REST APIs with React as the layer that maps the model to the view. What's missing is API discovery which MCP provides.

The problem with the concept is not really the tech. The problem is the incentives. Companies don't have much incentive to offer APIs, in most cases. It just risks adding a middleman who will try and cut them out. Not many businesses want to be reduced to being just an API provider, it's a dead end business and thus a dead end career/lifestyle for the founders or executives. The telcos went through this in the early 2000s where their CEOs were all railing against a future of becoming "dumb pipes". They weren't able to stop it in the end, despite trying hard. But in many other cases companies did successfully avoid that fate.

MCP+API might be different or it might not. It eliminates some of the downsides of classical API work like needing to guarantee stability and commit to a feature set. But it still poses the risk of losing control of your own brand and user experience. The obvious move is for OpenAI to come along and demand a rev share if too many customers are interacting with your service via ChatGPT, just like Google effectively demand a revshare for sending traffic to your website because so many customers interact with the internet via web search.

41. sitkack ◴[] No.45037678{4}[source]
It's not just an ad; it is a fundamental paradigm shift.
42. zukzuk ◴[] No.45037975{4}[source]
The people that use these UIs are already imprecise and non deterministic, yet that hasn’t stopped anyone from hiring them.

The potential advantage of using non-deterministic AI for this is that 1) “programming” it to do what needs to be done is a lot easier, and 2) it tends to handle exceptions more gracefully.

You’re right that the approach is nothing new, but it hasn’t taken off, arguably at least in part because it’s been too cumbersome to be practical. I have some hope that LLMs will help change this.

43. ◴[] No.45038003[source]
44. chamomeal ◴[] No.45038316{4}[source]
I’ve been thinking about this a lot too, in terms of signal/noise. LLMs can extract signal from noise (“summarize this fluff-filled 2 page corporate email”) but they can also create a lot of noise around signal (“write me a 2 page email that announces our RTO policy”).

If you’re using LLMs to extract signal, then the information should have been denser/more queryable in the first place. Maybe the UI could have been better, or your boss could have had better communication skills.

If you’re using them to CREATE noise, you need to stop doing that lol.

Most of the uses of LLMs that I see are mostly extracting signal or making noise. The exception to these use cases is making decisions that you don’t care about, and don’t want to make on your own.

I think this is why they’re so useful for programming. When you write a program, you have to specify every single thing about the program, at the level of abstraction of your language/framework. You have to make any decision that can’t be automated. Which ends up being a LOT of decisions. How to break up functions, what you name your variables, do you map/filter or reduce that list, which side of the API do you format the data on, etc. In any given project you might make 100 decisions, but only care about 5 of them. But because it’s a program, you still HAVE to decide on every single thing and write it down.

A lot of this has been automated (garbage collectors remove a whole class of decision making), but some of it can never be. Like maybe you want a landing page that looks vaguely like a skate brand. If you don’t specifically have colors/spacing/fonts all decided on, an LLM can make those decisions for you.

replies(1): >>45051747 #
45. akrymski ◴[] No.45040560[source]
I think this will fail for the same reason RSS failed - the business case just isn't there.
46. digitaltrees ◴[] No.45046745{3}[source]
This is definitely true but there are more reasons that explain why so many teams choose the seemingly irrational path. First, so many APIs are designed differently, so even if you decide the business negotiation is worth it you have development work ahead. Second, tons of vendors don’t even have an API. So the thought of building a tool once is appealing
replies(1): >>45049389 #
47. digitaltrees ◴[] No.45046968{4}[source]
The cost to develop and maintain UI automation is prohibitive for most companies
48. asdff ◴[] No.45047469{3}[source]
It begs the question though. If these vendors are so closely guarded of their API to try and shake down people for an enterprise license, why would they suddenly be permissive towards the LLM subverting that payment flow? Chances are the fact the LLM can interact with these systems is a blip: once they do see appreciable adoption the systems will be locked down to prevent the LLM from essentially pirating your service for you.
49. asdff ◴[] No.45047476{3}[source]
And yet, they exist, and software has been built on top of them already.
50. makeitdouble ◴[] No.45049389{4}[source]
Those are of course valid points. The counterpart being that a vendor might not have an API because they actively don't want to (Twitter/X for instance...), and when they have one, clients trying to circumvent their system to basically scrape the user UX won't be welcomed either.

So most of the time that path of "build a tool once" will be adversarial towards the service, which will be incentivized to actively kill your ad-hoc integration if they can without too much collateral damage.

51. originalvichy ◴[] No.45051747{5}[source]
That's a nice way of explaining it. I also feel like some sort of LLM purist by being critical of features that serve only to pollute emails and comms with robotic text not written by an actual person. We will as societies have to come up with a new metric for TL;DR or "this was a perfectly cohesive and concise text", since LLMs have obscured the line.