←back to thread

169 points anerli | 2 comments | | HN request time: 0s | source

Hey HN, Anders and Tom here - we’ve been building an end-to-end testing framework powered by visual LLM agents to replace traditional web testing.

We know there's a lot of noise about different browser agents. If you've tried any of them, you know they're slow, expensive, and inconsistent. That's why we built an agent specifically for running test cases and optimized it just for that:

- Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)

- Use tiny VLM (Moondream) instead of OpenAI/Anthropic computer use for dramatically faster and cheaper execution

- Use two agents: one for planning and adapting test cases and one for executing them quickly and consistently.

The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.

It’s completely open source. Would love to have more people try it out and tell us how we can make it great.

Repo: https://github.com/magnitudedev/magnitude

Show context
badmonster ◴[] No.43796404[source]
How does Magnitude differentiate between the planner and executor LLM roles, and how customizable are these components for specific test flows?
replies(1): >>43796586 #
anerli ◴[] No.43796586[source]
So the prompts that are sent to the planner vs executor are completely distinct. We allow complete customization of the planner LLM with all major providers (Anthropic, OpenAI, Google AI Studio, Google Vertex AI, AWS Bedrock, OpenAI compatible). The executor LLM on the other hand has to fit very specific criteria, so we only support the Moondream model right now. For a model to act as the executor it needs to be able to specific specific pixel coordinates (only a few models support this, for example OpenAI/Anthropic computer use, Molmo, Moondream, and some others). We like Moondream because its super tiny and fast (2B). This means as long as we still have a "smart" planner LLM we can have very fast/cheap execution and precise UI interaction.
replies(1): >>43796881 #
1. badmonster ◴[] No.43796881[source]
does Moondream handle multi-step UI tasks reliably (like opening a menu, waiting for render, then clicking), or do you have to scaffold that logic separately in the planner?
replies(1): >>43797150 #
2. anerli ◴[] No.43797150[source]
The planner can plan out multiple web actions at once, which Moondream can then execute in sequence on its own. So Moondream is never deciding how to execute more than one web action in a single prompt.

What this really means for developers writing the tests is you don't really have to worry about it. A "step" in Magnitude can map to any number of web actions dynamically based on the description, and the agents will figure out how to do it repeatably.