←back to thread

223 points edunteman | 2 comments | | HN request time: 0.43s | source

Hi HN! Erik here from Pig.dev, and today I'd like to share a new project we've just open sourced:

Muscle Mem is an SDK that records your agent's tool-calling patterns as it solves tasks, and will deterministically replay those learned trajectories whenever the task is encountered again, falling back to agent mode if edge cases are detected. Like a JIT compiler, for behaviors.

At Pig, we built computer-use agents for automating legacy Windows applications (healthcare, lending, manufacturing, etc).

A recurring theme we ran into was that businesses already had RPA (pure-software scripts), and it worked for them in most cases. The pull to agents as an RPA alternative was not to have an infinitely flexible "AI Employees" as tech Twitter/X may want you to think, but simply because their RPA breaks under occasional edge-cases and agents can gracefully handle those cases.

Using a pure-agent approach proved to be highly wasteful. Window's accessibility APIs are poor, so you're generally stuck using pure-vision agents, which can run around $40/hr in token costs and take 5x longer than a human to perform a workflow. At this point, you're better off hiring a human.

The goal of Muscle-Mem is to get LLMs out of the hot path of repetitive automations, intelligently swapping between script-based execution for repeat cases, and agent-based automations for discovery and self-healing.

While inspired by computer-use environments, Muscle Mem is designed to generalize to any automation performing discrete tasks in dynamic environments. It took a great deal of thought to figure out an API that generalizes, which I cover more deeply in this blog: https://erikdunteman.com/blog/muscle-mem/

Check out the repo, consider giving it a star, or dive deeper into the above blog. I look forward to your feedback!

Show context
hackgician ◴[] No.43989382[source]
accessibility (a11y) trees are super helpful for LLMs; we use them extensively in stagehand! the context is nice for browsers, since you have existing frameworks like selenium/playwright/puppeteer for actually acting on nodes in the a11y tree.

what does that analog look like in more traditional computer use?

replies(2): >>43989457 #>>43990332 #
1. ctoth ◴[] No.43989457[source]
There are a variety of accessibility frameworks from MSAA (old, windows-only) IA2, JAB, UIA (newer). NVDA from NV Access has an abstraction over these APIs to standardize gathering roles and other information from the matrix of a11y providers, though note the GPL license depending on how you want to use it.
replies(1): >>43990311 #
2. edunteman ◴[] No.43990311[source]
Our experience working with A11y apis like above is that data is frequently missing, and the APIs can be shockingly slow to read from. The highest performing agents in WindowsArena use a mixture of A11y and yolo-like grounding models such as Omniparser, with A11y seeming shifting out of vogue in favor of computer vision, due to it giving incomplete context.

Talking with users who just write their own RPA, they most loved APIs for doing so was consistently https://github.com/asweigart/pyautogui, which does offer A11y APIs but they're messy enough that many of the teams I talked to used the pyautogui.locateOnScreen('button.png') fuzzy image matching feature.