←back to thread

223 points edunteman | 1 comments | | HN request time: 1.769s | source

Hi HN! Erik here from Pig.dev, and today I'd like to share a new project we've just open sourced:

Muscle Mem is an SDK that records your agent's tool-calling patterns as it solves tasks, and will deterministically replay those learned trajectories whenever the task is encountered again, falling back to agent mode if edge cases are detected. Like a JIT compiler, for behaviors.

At Pig, we built computer-use agents for automating legacy Windows applications (healthcare, lending, manufacturing, etc).

A recurring theme we ran into was that businesses already had RPA (pure-software scripts), and it worked for them in most cases. The pull to agents as an RPA alternative was not to have an infinitely flexible "AI Employees" as tech Twitter/X may want you to think, but simply because their RPA breaks under occasional edge-cases and agents can gracefully handle those cases.

Using a pure-agent approach proved to be highly wasteful. Window's accessibility APIs are poor, so you're generally stuck using pure-vision agents, which can run around $40/hr in token costs and take 5x longer than a human to perform a workflow. At this point, you're better off hiring a human.

The goal of Muscle-Mem is to get LLMs out of the hot path of repetitive automations, intelligently swapping between script-based execution for repeat cases, and agent-based automations for discovery and self-healing.

While inspired by computer-use environments, Muscle Mem is designed to generalize to any automation performing discrete tasks in dynamic environments. It took a great deal of thought to figure out an API that generalizes, which I cover more deeply in this blog: https://erikdunteman.com/blog/muscle-mem/

Check out the repo, consider giving it a star, or dive deeper into the above blog. I look forward to your feedback!

Show context
joshstrange ◴[] No.43989110[source]
This is a neat idea and is similar to something I've been turning over in my head. LLMs are very powerful for taking a bunch of disparate tools/information/etc and generating good results but the speed is a big issue as well as reproducibility.

I keep imagining an Agent that writes a bunch of custom tools when it needs it and "saves" them for later use. Creating pipelines in code/config that it can reuse instead of solving from 0 each time.

Essentially, I want to use LLM for what they are good for (edge cases, fuzzy instructions/data) and have it turn around to write reusable tools so that the next time it doesn't have to run the full LLM, it can use a tiny LLM router up front to determine if there exists a tool to do this already. I'm not talking about MCP (though that is cool), this would use MCP tools but it could make new ones from the existing.

Here is an example.

Imagine I have an Agent with MCP tools to read/write to my email, calendar, ticketing system, and slack. I can ask the LLM to slack me every morning with an overview of my events for the day and anything outstanding I need to address. Maybe the first pass uses a frontier model to determine which tools to use and it accomplishes this task. Once I'm happy with the output then the Agent feeds the conversation/tool calls into another LLM to distill it to a Python/Node/Bash/whatever script. That script would call the MCP tools to do the same thing and use small LLMs to glue the results together and then it creates a cron (or similar) entry to have that run every morning.

I feel like this would remove a ton of the uncertainty when it comes to which tools an LLM uses without requiring humans to write custom flows with limited tools available for each task.

So the first pass would be:

    User: Please check my email, calendar, and slack for what I need to focus on today.
    
    LLM: Tool Call: Read Unread Email
    
    LLM: Tool Call: Read last 7 days of emails the user replied to
    
    LLM: Tool Call: Read this week's events from calendar
    
    LLM: Tool Call: Read unread slack messages

    LLM: Tool Call: Read tickets in this sprint

    LLM: Tool Call: Read unread comments on tickets assigned to me
    
    LLM: Tool Call: Read slack messages conversations from yesterday
    
    LLM: Please use the following data to determine what the user needs to focus on today: <Inject context from tool calls>
    
    LLM: It looks like have 3 meetings today at.....


Then a fresh LLM reviews that and writes a script to do all the tool calls and jump to the last "Please use the following data" prompt which can be reused (cron'd or just called when it makes sense).

I might be way off-base and I don't work in the space (I just play around the edges) but this feels like a way to let agents "learn" and grow. I've just found that in practice you don't get good results from throwing all your tools at 1 big LLM with your prompt, you're better off limiting the tools and even creating compound tools for certain jobs you do over and over. I've found that lots of little tool calls add up and take a long time so a way for the agent to dynamically create tools from combining other tools seems like a huge win.

replies(1): >>43990374 #
1. edunteman ◴[] No.43990374[source]
This is very similar to what Voyager did https://arxiv.org/abs/2305.16291

Their implementation uses actual code, JS scripts in their case, as the stored trajectories, which has the neat feature of parameterization built in so trajectories are more reusable.

I experimented with this for a bit for Muscle Mem, but a trajectory being performed by just-in-time generated scripts felt too magical and wild west. An explicit goal of Muscle Mem is to be a deterministic system, more like a DB, on which you as a user can layer as much nondeterminism as you feel comfortable with.