We're working on 2 major improvements that will get cost down at scale: 1. We're building a code generation layer under the hood that will start to memorize actions Skyvern has taken on a website, so repeated runs will be nearly free 2. We're exploring some graph re-ranking techniques to eliminate useless elements from the HTML DOM when analyzing the page. For example, if you're looking at the product page and want to add a product to cart, the likelihood you'll need to interact with the Reviews page will be 0. No need to send that context along to the LLM
Computer vision is useful and very quick, however, it has been my experience parsing stacking context is much more useful. The problem is creating a stacking context when a news site embeds a youtube or blusky post. It requires injecting script into each using playwright. (Not mine, but, prior art [0]).
I've been quietly solving a problem I encountered creating browser agents that didn't have a solution 2 years ago in my free time. Most webpages are several independent global execution contexts and I'm developing a coherent way to get them all to speak with each other. [1]
> "Go to Amazon.com and add an iPhone 16, a screen protector, and a case to cart"
Are you familiar with Google Dialogflow? [2] It is a service which returns an object with intent and parameters which make it is to map to automation actions. I asked GhatGPT to help with an example of how Dialogflow might handle your request. [3]
[0] https://github.com/andreadev-it/stacking-contexts-inspector
[1] https://news.ycombinator.com/item?id=42576240
[2] https://cloud.google.com/dialogflow/es/docs/intents-overview
[3] https://chatgpt.com/share/678ae18d-5370-8004-97d4-f9949887b0...