←back to thread

Claude for Chrome

(www.anthropic.com)
795 points davidbarker | 4 comments | | HN request time: 0s | source
Show context
parsabg ◴[] No.45031888[source]
I built a very similar extension [1] a couple of months ago that supports a wide range of models, including Claude, and enables them to take control of a user's browser using tools for mouse and keyboard actions, observation, etc. It's a fun little project to look at to understand how this type of thing works.

It's clear to me that the tech just isn't there yet. The information density of a web page with standard representations (DOM, screenshot, etc) is an order of magnitude lower than that of, say, a document or piece of code, which is where LLMs shine. So we either need much better web page representations, or much more capable models, for this to work robustly. Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly. Dia, Comet, Browser Use, Gemini, etc are all attacking this and have big incentives to crack it, so we should expect decent progress here.

A funny observation was that some models have been clearly fine tuned for web browsing tasks, as they have memorized specific selectors (e.g. "the selector for the search input in google search is `.gLFyf`").

[1] https://github.com/parsaghaffari/browserbee

replies(11): >>45032377 #>>45032556 #>>45032983 #>>45033328 #>>45033344 #>>45033797 #>>45033828 #>>45035580 #>>45036238 #>>45037152 #>>45040560 #
asdff ◴[] No.45033328[source]
It is kind of funny how the systems are set up where there often is dense and queryable information out there already for a lot of these tasks, but these are ignored in favor of the difficult challenge of brute forcing the human consumer facing ui instead of some existing api that is designed to be machine readable already. E.g. booking flights. Travel agents use software that queries all the airlines ticket inventory to return flight information to you the consumer. The issue of booking a flight is theoretically solved already by virtue of these APIs that already exist to do just that. But for AI agents this is now a stumbling block because it would presumably take a little bit of time to craft out a rule to cover this edge case and return far more accurate information and results. Consumers with no alternative don't know what they are missing so there is no incentive to improve this.
replies(6): >>45033728 #>>45034068 #>>45034115 #>>45034274 #>>45034337 #>>45034796 #
1. shswkna ◴[] No.45034337[source]
To add to this, it is even funnier how travel agents undergo training in order to be able to interface with and operate the “machine readable“ APIs for booking flight tickets.

What a paradoxical situation now emerges, where human travel agents still need to train for the machine interface, while AI agents are now being trained to take over the human jobs by getting them to use the consumer interfaces (aka booking websites) available to us.

replies(1): >>45036747 #
2. originalvichy ◴[] No.45036747[source]
This is exactly the conversation I had with a colleague of mine. They were excited about how LLMs can help people interact with data and visualize it nicely, but I just had to ask - with as little snark as possible - if this wasn't what a monitor and a UI were already doing? It seems like these LLMs are being used as the cliche "hammer that solves all the problems" where problems didn't even exist. Just because we are excited about how an LLM can chew through formatted API data (which is hard for humans to read) doesn't mean that we didn't already solve this with UIs displaying this data.

I don't know why people want to turn the internet into a turn-based text game. The UI is usually great.

replies(1): >>45038316 #
3. chamomeal ◴[] No.45038316[source]
I’ve been thinking about this a lot too, in terms of signal/noise. LLMs can extract signal from noise (“summarize this fluff-filled 2 page corporate email”) but they can also create a lot of noise around signal (“write me a 2 page email that announces our RTO policy”).

If you’re using LLMs to extract signal, then the information should have been denser/more queryable in the first place. Maybe the UI could have been better, or your boss could have had better communication skills.

If you’re using them to CREATE noise, you need to stop doing that lol.

Most of the uses of LLMs that I see are mostly extracting signal or making noise. The exception to these use cases is making decisions that you don’t care about, and don’t want to make on your own.

I think this is why they’re so useful for programming. When you write a program, you have to specify every single thing about the program, at the level of abstraction of your language/framework. You have to make any decision that can’t be automated. Which ends up being a LOT of decisions. How to break up functions, what you name your variables, do you map/filter or reduce that list, which side of the API do you format the data on, etc. In any given project you might make 100 decisions, but only care about 5 of them. But because it’s a program, you still HAVE to decide on every single thing and write it down.

A lot of this has been automated (garbage collectors remove a whole class of decision making), but some of it can never be. Like maybe you want a landing page that looks vaguely like a skate brand. If you don’t specifically have colors/spacing/fonts all decided on, an LLM can make those decisions for you.

replies(1): >>45051747 #
4. originalvichy ◴[] No.45051747{3}[source]
That's a nice way of explaining it. I also feel like some sort of LLM purist by being critical of features that serve only to pollute emails and comms with robotic text not written by an actual person. We will as societies have to come up with a new metric for TL;DR or "this was a perfectly cohesive and concise text", since LLMs have obscured the line.