←back to thread

144 points anerli | 1 comments | | HN request time: 0s | source

Hey HN, Anders and Tom here. We had a post about our AI test automation framework 2 months ago that got a decent amount of traction (https://news.ycombinator.com/item?id=43796003).

We got some great feedback from the community, with the most positive response being about our vision-first approach used in our browser agent. However, many wanted to use the underlying agent outside the testing domain. So today, we're releasing our fully featured AI browser automation framework.

You can use it to automate tasks on the web, integrate between apps without APIs, extract data, test your web apps, or as a building block for your own browser agents.

Traditionally, browser automation could only be done via the DOM, even though that’s not how humans use browsers. Most browser agents are still stuck in this paradigm. With a vision-first approach, we avoid relying on flaky DOM navigation and perform better on complex interactions found in a broad variety of sites, for example:

- Drag and drop interactions

- Data visualizations, charts, and tables

- Legacy apps with nested iframes

- Canvas and webGL-heavy sites (like design tools or photo editing)

- Remote desktops streamed into the browser

To interact accurately with the browser, we use visually grounded models to execute precise actions based on pixel coordinates. The model used by Magnitude must be smart enough to plan out actions but also able to execute them. Not many models are both smart *and* visually grounded. We highly recommend Claude Sonnet 4 for the best performance, but if you prefer open source, we also support Qwen-2.5-VL 72B.

Most browser agents never make it to production. This is because of (1) the flaky DOM navigation mentioned above, but (2) the lack of control most browser agents offer. The dominant paradigm is you give the agent a high-level task + tools and hope for the best. This quickly falls apart for production automations that need to be reliable and specific. With Magnitude, you have fine-grained control over the agent with our `act()` and `extract()` syntax, and can mix it with your own code as needed. You also have full control of the prompts at both the action and agent level.

```ts

// Magnitude can handle high-level tasks

await agent.act('Create an issue', {

  // Optionally pass data that the agent will use where appropriate

  data: {

    title: 'Use Magnitude',

    description: 'Run "npx create-magnitude-app" and follow the instructions',

  },
});

// It can also handle low-level actions

await agent.act('Drag "Use Magnitude" to the top of the in progress column');

// Intelligently extract data based on the DOM content matching a provided zod schema

const tasks = await agent.extract(

    'List in progress issues',

    z.array(z.object({

        title: z.string(),

        description: z.string(),

        // Agent can extract existing data or new insights

        difficulty: z.number().describe('Rate the difficulty between 1-5')

    })),
);

```

We have a setup script that makes it trivial to get started with an example, just run "npx create-magnitude-app". We’d love to hear what you think!

Repo: https://github.com/magnitudedev/magnitude

Show context
dataviz1000 ◴[] No.44393178[source]
Hey guys, I got a question.

I've been working on a Chrome extension with a side panel. Think about it like the side panel copilot in VSCode, Cursor, or Windsurf. Currently it is automating workflows but those are hard coded. I've started working on a more generalized automation using langchain. Looking at your code is helpful because I can in only a few hundred lines of code recreate a huge portion Playwright's capabilities in a Chrome extension side panel so I should be able to port it to the Chrome extension. That is, I'm creating a tools like mouse click, type, mouse move, open tab, navigate, wait for element, ect..

Looking at your code, I'm thinking about pulling anything that isn't coupled to node while mapping all the Playwright capabilities to the equivalent in a Chrome extension. It's busy work.

If I do that why would I prefer using .baml over the equivalent langchain? What's the differnce? Am I'm comparing apples to oranges? I'm not worried about using langgraph because I should be able to get most of the functionality with xstate v5 [0] plus serialized portable JSON state graphs so I can store custom graphs on a remote server that can be queried by API.

That is my question. I don't see langchain in the dependencies which is cool, but why .baml? Also, what am I'm missing going down this thought path?

[0] https://chatgpt.com/share/685dfc60-106c-8004-bbd0-1ba3a33aba...

replies(1): >>44393241 #
anerli ◴[] No.44393241[source]
Hey, curious about your use cases for a chrome extension, care to share more?

To answer your question - BAML is as DSL that helps to define prompts, organize context, and to get better performance on structured output from the LLM. In theory you should be able to map over similar logic to other clients.

replies(2): >>44393287 #>>44393869 #
1. pzo ◴[] No.44393287[source]
Chrome extension has advantage of user friendly distribution - so that non tech savy users can also do automation. I'm also looking for automation for mobile devices (app webview or safari mobile) and because of platform limitation also this doesn't seem can by anytime extended to mobile devices