Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals

(blog.skyvern.com)

49 points suchintan | 4 comments | 17 Jan 25 15:23 UTC | HN request time: 0.832s | source

Show context

lyime ◴[17 Jan 25 20:28 UTC] No.42742916[source]▶

This is an impressive tool. I especially like the observability around the workflow and the steps it takes to achieve the outcome. We are potentially interested in exploring this if we can get the cost down at scale.

replies(1): >>42743230 #

1. suchintan ◴[17 Jan 25 21:07 UTC] No.42743230[source]▶

>>42742916 #

I'd love to chat to see how we can help! Here's my email: suchintan@skyvern.com

We're working on 2 major improvements that will get cost down at scale: 1. We're building a code generation layer under the hood that will start to memorize actions Skyvern has taken on a website, so repeated runs will be nearly free 2. We're exploring some graph re-ranking techniques to eliminate useless elements from the HTML DOM when analyzing the page. For example, if you're looking at the product page and want to add a product to cart, the likelihood you'll need to interact with the Reviews page will be 0. No need to send that context along to the LLM

replies(1): >>42744064 #

2. dataviz1000 ◴[17 Jan 25 22:46 UTC] No.42744064[source]▶

>>42743230 (TP) #

> We're exploring some graph re-ranking techniques to eliminate useless elements from the HTML DOM when analyzing the page.

Computer vision is useful and very quick, however, it has been my experience parsing stacking context is much more useful. The problem is creating a stacking context when a news site embeds a youtube or blusky post. It requires injecting script into each using playwright. (Not mine, but, prior art [0]).

I've been quietly solving a problem I encountered creating browser agents that didn't have a solution 2 years ago in my free time. Most webpages are several independent global execution contexts and I'm developing a coherent way to get them all to speak with each other. [1]

> "Go to Amazon.com and add an iPhone 16, a screen protector, and a case to cart"

Are you familiar with Google Dialogflow? [2] It is a service which returns an object with intent and parameters which make it is to map to automation actions. I asked GhatGPT to help with an example of how Dialogflow might handle your request. [3]

[0] https://github.com/andreadev-it/stacking-contexts-inspector

[1] https://news.ycombinator.com/item?id=42576240

[2] https://cloud.google.com/dialogflow/es/docs/intents-overview

[3] https://chatgpt.com/share/678ae18d-5370-8004-97d4-f9949887b0...

replies(1): >>42745613 #

3. MarcelOlsz ◴[18 Jan 25 03:39 UTC] No.42745613[source]▶

>>42744064 #

How can I reach out to you?

replies(1): >>42746044 #

4. dataviz1000 ◴[18 Jan 25 05:12 UTC] No.42746044{3}[source]▶

>>42745613 #

I'm at [HN username]@gmail.com

↑