You can start a Mastra project with `npm create mastra` and create workflow graphs that can suspend/resume, build a RAG pipeline and write evals, give agents memory, create multi-agent workflows, and view it all in a local playground.
Previously, we built Gatsby, the open-source React web framework. Later, we worked on an AI-powered CRM but it felt like we were having to roll all the AI bits (agentic workflows, evals, RAG) ourselves. We also noticed our friends building AI applications suffering from long iteration cycles: they were getting stuck debugging prompts, figuring out why their agents called (or didn’t call) tools, and writing lots of custom memory retrieval logic.
At some point we just looked at each other and were like, why aren't we trying to make this part easier, and decided to work on Mastra.
Demo video: https://www.youtube.com/watch?v=8o_Ejbcw5s8
One thing we heard from folks is that seeing input/output of every step, of every run of every workflow, is very useful. So we took XState and built a workflow graph primitive on top with OTel tracing. We wrote the APIs to make control flow explicit: `.step()` for branching, `.then()` for chaining, and `.after()` for merging. We also added .`.suspend()/.resume()` for human-in-the-loop.
We abstracted the main RAG verbs like `.chunk()`, `embed()`, `.upsert(),’ `.query()`, and `rerank()` across document types and vector DBs. We shipped an eval runner with evals like completeness and relevance, plus the ability to write your own.
Then we read the MemGPT paper and implemented agent memory on top of AI SDK with a `lastMessages` key, `topK` retrieval, and a `messageRange` for surrounding context (think `grep -C`).
But we still weren’t sure whether our agents were behaving as expected, so we built a local dev playground that lets you curl agents/workflows, chat with agents, view evals and traces across runs, and iterate on prompts with an assistant. The playground uses a local storage layer powered by libsql (thanks Turso team!) and runs on localhost with `npm run dev` (no Docker).
Mastra agents originally ran inside a Next.js app. But we noticed that AI teams’ development was increasingly decoupled from the rest of their organization, so we built Mastra so that you can also run it as a standalone endpoint or service.
Some things people have been building so far: one user automates support for an iOS app he owns with tens of thousands of paying users. Another bundled Mastra inside an Electron app that ingests aerospace PDFs and outputs CAD diagrams. Another is building WhatsApp bots that let you chat with objects like your house.
We did (for now) adopt an Elastic v2 license. The agent space is pretty new, and we wanted to let users do whatever they want with Mastra but prevent, eg, AWS from grabbing it.
If you want to get started: - On npm: npm create mastra@latest - Github repo: https://github.com/mastra-ai/mastra - Demo video: https://www.youtube.com/watch?v=8o_Ejbcw5s8 - Our website homepage: https://mastra.ai (includes some nice diagrams and code samples on agents, RAG, and links to examples) - And our docs: https://mastra.ai/docs
Excited to share Mastra with everyone here – let us know what you think!
We think about evals a bit like perf monitoring -- it's good to have RUM but also good to have some synthetic stuff in your CI. So if you do find them valuable, useful to do both.
import { Agent } from "@mastra/core/agent"; import { openai } from "@ai-sdk/openai";
export const myAgent = new Agent({ name: "My Agent", instructions: "You are a helpful assistant.", model: openai("gpt-4o-mini"), });
Do you think that a lot of these components like observability and evals will eventually be consumed by either providers (like OpenAI) or an orchestration framework like Mastra (when using multiple providers, though even if you're using just one provider for many tasks I can see it belonging to the orchestration framework)?
We hear a lot from people who are outgrowing the voice agent platforms and moving to something like pipecat (in Python), and we'd love to be the JS option.
- https://mastra.ai/docs/agents/01-agent-memory
- https://blog.langchain.dev/langmem-sdk-launch/
- https://help.getzep.com/concepts#adding-memory
not sure where all this is leading yet but glad people are exploring.
https://js.langchain.com/docs/introduction/
https://www.vellum.ai/products/workflows-sdk
https://github.com/transitive-bullshit/agentic
which is not to say any of them got it right or wrong, but it is by no means "missing". the big question w all of them is do they deliver enough value to last. kudos to those who at least try, of course
imho getting some sort of hierarchical memory is conceptually fairly straightforward, the tricky part is having the storage and vector db pieces well integrated so that the apis are clean
Also the team is top-notch — Sam was my co-founder at Gatsby and I worked closely with Shane and Abhi and I have a ton of confidence in their product & engineering abilities.
Can anyone please give me a usecase, that couldn’t be solved with a single API call to a modern LLM (capable of multi-step planning/reasoning) and a proper prompt?
Or is this really just about building the prompt, and giving the LLM closer guidance by splitting into multiple calls?
I’m specifically not asking about function calling.
Additionally, in self-hosted environments, using an agent-based approach can be more cost-effective. Simpler or less computationally intensive tasks can be offloaded to smaller models, which not only reduces costs but also improves response times.
That being said, this approach is most effective when dealing with structured workflows that can be logically decomposed. In more open-ended tasks, such as "build me an app," the results can be inconsistent unless the task is well-scoped or has extensive precedent (e.g., generating a simple Pong clone). In such cases, additional oversight and iterative refinement are often necessary.
"Aider now has experimental support for using two models to complete each coding task:
An Architect model is asked to describe how to solve the coding problem.
An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.
Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars)."
In particular, recent discord chat suggests that o3m is the most effective architect and Claude Sonnet is the most effective code editor.
I found `mcp-proxy-server` [0] which seemed like it would do what I want but I ran into multiple problems. I added some minor debug logging to it and the ball sort of rolled downhill from there. Now it's more my code than what was there originally but I have tool proxying working for multiple clients (respecting sessionIds, etc) and I think I've solved most all the issues I've run into and added features like optional tool prefixing so there isn't overlap between MCP servers.
Given what I know now, I don't think N-to-1 is quite as useful as I thought. Or rather, it really depends on your "client". If you can toggle on/off tools in your client then it's not a big problem but sometimes you don't want "all" the tools and if you client only allows toggling per MCP server then you will have an issue.
I love the ideas of workflows and how you have defined agents. I think my current issue is almost too many tools and the LLM sometimes gets confused over which ones to use. I'm especially thrilled with your HTTP endpoints you expose for the agents. My main MCP server (my custom tools I wrote, vs the third-party ones) exposes an HTTP GUI for calling the tools (faster iteration vs trying it through LLMs) and I've been using that and 3rd-party chat clients (LibreChat and OpenWebUI) as my "LLM testing" platform (because I wasn't aware of a better options) but neither of those tools let you "re-expose" the agents via an API.
All in all I'm coming to the conclusion that 90% of MCP servers out there are really cool for seeing what's possible but it's probably best to write your own tools/MCP since most all MCP servers are just thin wrappers around an API. Also it's so easy to create an MCP server that they are popping up all over the place and often of low quality (don't fully implement the API, take shortcuts for the authors use-case, etc). Using LLMs to writing the "glue" code from API->Tool is fairly minor and I think is worth "owning". To sum that all up: I think my usage of 3rd party MCP servers is going to trend towards 0 as I "assimilate" MCP servers into my own codebase for more control but I really like MCP as a way to vend tools to various different LLM clients/tools.
https://docs.mcp.run/tutorials/mcpx-mastra-ts
you don't even need to use SSE, as mcp.run brings the tools directly to your agent, in-process, as secure wasm modules.
mcp.run does have SSE support for all its servlet tools in the registry though too.
Here is an example-- I highlight physical books as I read them with a red pen. Sometimes my highlights are underlines, sometimes I bracket relevant text. I also write some comments in the margins.
I want to photograph relevant pages and get the highlights and my comments into plain text. If I send an image of a highlighted/commented page to ChatGPT and ask to get everything into plain text, it doesn't work. It's just not smart enough to do it in one prompt. So, you have to do it in steps. First you ask for the comments. Then for underlined highlights. Then for bracketed highlights. Then you merge the output. Empirically, this produces much better results. (This is a really simple example; but imagine you add summarization or something, then the steps feed into each other)
As these things get complicated, you start bumping into repeated problems (like understanding what's happening between each step, tweaking prompts, etc.) Having a library with some nice tooling can help with those. It's not especially magical and nothing you couldn't do yourself. But you also could write Datadog or Splunk yourself. It's just convenient not to.
The internet decided to call these types of programs agents, which confuses engineers like you (and me) who tend to think concretely. But if you get past that word, and maybe write an example app or something, I promise these things will make sense.
You got. But that is the interesting part! To make AI useful, beyond basic content generation in a chat context you need interaction with the outside world. And you may need iterative workflows that can spawn more work based on the output of those interactions. The focus on Agents as personas is a tangent to the core use case. We could just call this stuff "AI Workflow Orchestration" or something ... and it would remain pretty useful!
MCP is super cool and I've loved playing with it but playing with it is all I'm doing. I'm working on some tools to use in my $dayJob and also just using it as an excuse to learn about LLMs and play with new tech. Most my work is writing tools that connect our to our distributed fleet of servers to collect data, run commands, etc. My goal is to build a SlackOps-type bot that can provide extra context about errors we get in Slack (Pull the latest commits/PRs around that code, link to current deployed version, provide all the logs for the request that threw an error, check system stats, etc). And while I have tools written to do all of that I'm still working on bringing it all together in something more than a bot I can invoke from Slack and make MCP calls.
All that to say, I'm not a professional user of MCP/Mastra and my opinion is probably not one you want shaping your framework.
However, the Gatsby CMS had a couple of things that were really interesting about it - especially runtime type safety through GraphQL and doing headless WordPress.
Anthropic does a good job of breaking down some common architecture around using these components [1] (good outline of this if you prefer video [2]).
"Agent" is definitely an overloaded term - the best framing of this I've seen is aligns more closely with the Anthropic definition. Specifically, an "agent" is a GenAI system that dynamically identifies the tasks ("steps" from the parent comment) without having to be instructed that those are the steps. There are obvious parallels to the reasoning capabilities that we've seen released in the latest cut of the foundation models.
So for example, the "Agent" would first build a plan for how to address the query, dynamically farm out the steps in that plan to other LLM calls, and then evaluate execution for correctness/success.
[1] https://www.anthropic.com/research/building-effective-agents [2] https://www.youtube.com/watch?v=pGdZ2SnrKFU
This is like 5% of the work. The developer needs to fill the other 95% which involves a lot more things that are strictly outside of scope of the framework.
And here’s a fun exercise: ask Claude via Cursor or Perplexity with R1 to create a basic agentic framework for you in your language of choice on top of Instructor.
> Really languages that can model state in sane ways and have a good concurrency story like Elixir make much more sense.
Can you expand on this? Curious why JS state modelling falls short here and what's wrong with the concurrency model in JS for agents.Agree, that's why I've been building this: https://github.com/agentjido/jido
I use their AI SDK, but never touch vercel servers. It's just a unified interface.
Project looks great, will follow & learn.
Just like in real life, there's generalists and experts. Depending on your task you might prefer an expert over a generalist, think f.e. brain surgery versus "summarize this text".
I would argue that python is the overrated language when it comes to building agents. Just because it's the language of choice for training models doesn't mean it should be for building apps against them.
The dx typescript brings to these types of applications is nice.
testWorkflow
.step(llm)
.then(decider)
.then(agentOne)
.then(workflow)
.after(decider)
.then(agentTwo)
.then(workflow)
.commit();
On a first glance, this looks like a very awkward way to represent the graph from the picture. And this is just a simple "workflow" (the structure of the graph does not depend on the results of the execution), not an agent.How would you simplify this?
2. Related to previous question, since this is node based, is it possible to support websockets?
It's telling that the example relies on arbitrary indentation (which a linter will get rid of) to have some hope of comprehending it
Possibly this was all motivated by a desire to avoid nested structures above all?
But for a branching graph a nested structure is more natural. It'd also probably be nicer if the methods were on the task nodes instead of on the workflow, then you could avoid the 'step'/'then' distinction and have something like:
e.g.
testWorkflow(
llm
.then(decider)
.then(
agentOne.then(workflow),
agentTwo.then(workflow),
)
)
Here’s how it would look like in my system:
new Flow<string>()
.step("llm", llmStepHandler)
.step("decider", ["llm"], deciderStepHandler)
.step("agentOne", ["decider"], agentOneStepHandler)
.step("agentTwo", ["decider"], agentTwoStepHandler)
.step("workflow", ["agentOne", "agentTwo"], workflowStepHandler);
Mine is a DAG, so more constrained than the cyclic graph Mastra supports (if I understand correctly).I built my own TypeScript AI platform https://typedai.dev with an extensive feature list where I've kept iterating on what I find the most ergonomic way to develop, using standard constructs as much as possible. I've coded enough Java streams, RxJS chains, and JavaScript callbacks and Promise chains to know what kind of code I like to read and debug.
I was having a peek at xstate but after I came across https://docs.dbos.dev/ here recently I'm pretty sure that's that path I'll go down for durable execution to keep building everything with a simple programming model.
I don't think from first principles there's any broad framework that makes sense to be honest. I'll reach for a specific vector DB, or logging library, but beyond that you'll never convince me your "query-builder" API is going to make me build a better thing when I have the full power of TypeScript already.
Especially when these products start throwing in proprietary features and add-ons with fancy names on top.
> The dx typescript brings to these types of applications is nice.
Ironically, it only gets halfways there.What I've found is that teams that want TS probably should just move up to C#; they are close enough [0]. The main thing is that once you start to get serious with your backend API, then data integrity matters. TS types disappear at runtime and it's just JS. So you need a Zod or Valibot to validate the incoming data. Then your API starts getting bigger and you want to generate OpenAPI for your frontend. Now your fast and easy Node/Express app is looking a lot like Spring or .NET...without the headway and perf...the irony.
Personally I am not fond of the decorator approach and decided to not use it in pgflow (my soon-to-be-released workflow orchestration engine on top of Postgres).
1. I wanted it to be simple to reason about and explicit (being more verbose as a trade-off)
2. There are some issues with supporting decorators (Svelte https://github.com/sveltejs/svelte/issues/11502, and a lot of others).
3. I decided to only support directed acyclic graphs (no loops!) in order to promote simplicity. Will be supporting conditional recursive sub-workflows to provide a way to repeat some steps and be able to branch.
Cheers!
Calling that "lock in" is a stretch, but you're free to write everything from scratch if that's the way you roll.
essentially you're building a DAG so it could be worth checking some other APIs which do a similar thing for inspiration
e.g. it looks like in Airflow you could write it as:
chain(llm, decider, [agentOne, agentTwo], workflow)
https://airflow.apache.org/docs/apache-airflow/stable/core-c...