Most active commenters

    ←back to thread

    433 points calcsam | 17 comments | | HN request time: 0.852s | source | bottom

    Hi HN, we’re Sam, Shane, and Abhi, and we’re building Mastra (https://mastra.ai), an open-source JavaScript SDK for building agents on top of Vercel’s AI SDK.

    You can start a Mastra project with `npm create mastra` and create workflow graphs that can suspend/resume, build a RAG pipeline and write evals, give agents memory, create multi-agent workflows, and view it all in a local playground.

    Previously, we built Gatsby, the open-source React web framework. Later, we worked on an AI-powered CRM but it felt like we were having to roll all the AI bits (agentic workflows, evals, RAG) ourselves. We also noticed our friends building AI applications suffering from long iteration cycles: they were getting stuck debugging prompts, figuring out why their agents called (or didn’t call) tools, and writing lots of custom memory retrieval logic.

    At some point we just looked at each other and were like, why aren't we trying to make this part easier, and decided to work on Mastra.

    Demo video: https://www.youtube.com/watch?v=8o_Ejbcw5s8

    One thing we heard from folks is that seeing input/output of every step, of every run of every workflow, is very useful. So we took XState and built a workflow graph primitive on top with OTel tracing. We wrote the APIs to make control flow explicit: `.step()` for branching, `.then()` for chaining, and `.after()` for merging. We also added .`.suspend()/.resume()` for human-in-the-loop.

    We abstracted the main RAG verbs like `.chunk()`, `embed()`, `.upsert(),’ `.query()`, and `rerank()` across document types and vector DBs. We shipped an eval runner with evals like completeness and relevance, plus the ability to write your own.

    Then we read the MemGPT paper and implemented agent memory on top of AI SDK with a `lastMessages` key, `topK` retrieval, and a `messageRange` for surrounding context (think `grep -C`).

    But we still weren’t sure whether our agents were behaving as expected, so we built a local dev playground that lets you curl agents/workflows, chat with agents, view evals and traces across runs, and iterate on prompts with an assistant. The playground uses a local storage layer powered by libsql (thanks Turso team!) and runs on localhost with `npm run dev` (no Docker).

    Mastra agents originally ran inside a Next.js app. But we noticed that AI teams’ development was increasingly decoupled from the rest of their organization, so we built Mastra so that you can also run it as a standalone endpoint or service.

    Some things people have been building so far: one user automates support for an iOS app he owns with tens of thousands of paying users. Another bundled Mastra inside an Electron app that ingests aerospace PDFs and outputs CAD diagrams. Another is building WhatsApp bots that let you chat with objects like your house.

    We did (for now) adopt an Elastic v2 license. The agent space is pretty new, and we wanted to let users do whatever they want with Mastra but prevent, eg, AWS from grabbing it.

    If you want to get started: - On npm: npm create mastra@latest - Github repo: https://github.com/mastra-ai/mastra - Demo video: https://www.youtube.com/watch?v=8o_Ejbcw5s8 - Our website homepage: https://mastra.ai (includes some nice diagrams and code samples on agents, RAG, and links to examples) - And our docs: https://mastra.ai/docs

    Excited to share Mastra with everyone here – let us know what you think!

    1. brap ◴[] No.43106216[source]
    I don’t really understand agents. I just don’t get why we need to pretend we have multiple personalities, especially when they’re all using the same model.

    Can anyone please give me a usecase, that couldn’t be solved with a single API call to a modern LLM (capable of multi-step planning/reasoning) and a proper prompt?

    Or is this really just about building the prompt, and giving the LLM closer guidance by splitting into multiple calls?

    I’m specifically not asking about function calling.

    replies(9): >>43106401 #>>43106499 #>>43106505 #>>43106535 #>>43106552 #>>43106679 #>>43106770 #>>43107749 #>>43111518 #
    2. 2pointsomone ◴[] No.43106401[source]
    I don't work in prompt engineering but my partner does and she tells me numerous need for agents in cases where you want some technology which goes and seeks things on the live web and then comes back and you want to make sense of that found data with the LLM and pre-written prompts where you use that data as variables, and then possibly go back into the web if the task remains unsolved.
    replies(1): >>43106452 #
    3. dimgl ◴[] No.43106452[source]
    Can't that be solved with regular workflow tools and prompts? Is that what an agent is, essentially?

    Or is an agent a collection of prompts with a limited set of available tools?

    replies(1): >>43117996 #
    4. blainm ◴[] No.43106499[source]
    One of the key limitations of even state-of-the-art LLMs is that their coherence and usefulness tend to degrade as the context window grows. When tackling complex workflows, such as customer support automation or code review pipelines - breaking the process into smaller, well-defined tasks allows the model to operate with more relevant and focused context at each step, improving reliability.

    Additionally, in self-hosted environments, using an agent-based approach can be more cost-effective. Simpler or less computationally intensive tasks can be offloaded to smaller models, which not only reduces costs but also improves response times.

    That being said, this approach is most effective when dealing with structured workflows that can be logically decomposed. In more open-ended tasks, such as "build me an app," the results can be inconsistent unless the task is well-scoped or has extensive precedent (e.g., generating a simple Pong clone). In such cases, additional oversight and iterative refinement are often necessary.

    5. weego ◴[] No.43106505[source]
    I don't get it either. Watching implementations on YouTube etc it primarily it feels like a load of verbiage trying to carve out a sub-industry, but the meat on the bone just seems to be defining discreet units of AI actions that can be chained into workflows that interact with non-ai services.
    replies(1): >>43106822 #
    6. bravura ◴[] No.43106535[source]
    https://aider.chat/2024/09/26/architect.html

    "Aider now has experimental support for using two models to complete each coding task:

    An Architect model is asked to describe how to solve the coding problem.

    An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.

    Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars)."

    In particular, recent discord chat suggests that o3m is the most effective architect and Claude Sonnet is the most effective code editor.

    replies(1): >>43119041 #
    7. andrewmutz ◴[] No.43106552[source]
    Modularity. We could put all code in a single function, it is possible, but we prefer to organize it differently to make it easier to develop and reason about. Agents are similar
    8. coffeemug ◴[] No.43106679[source]
    If you ignore the word "agent" and autocomplete it in your mind to "step", things will make more sense.

    Here is an example-- I highlight physical books as I read them with a red pen. Sometimes my highlights are underlines, sometimes I bracket relevant text. I also write some comments in the margins.

    I want to photograph relevant pages and get the highlights and my comments into plain text. If I send an image of a highlighted/commented page to ChatGPT and ask to get everything into plain text, it doesn't work. It's just not smart enough to do it in one prompt. So, you have to do it in steps. First you ask for the comments. Then for underlined highlights. Then for bracketed highlights. Then you merge the output. Empirically, this produces much better results. (This is a really simple example; but imagine you add summarization or something, then the steps feed into each other)

    As these things get complicated, you start bumping into repeated problems (like understanding what's happening between each step, tweaking prompts, etc.) Having a library with some nice tooling can help with those. It's not especially magical and nothing you couldn't do yourself. But you also could write Datadog or Splunk yourself. It's just convenient not to.

    The internet decided to call these types of programs agents, which confuses engineers like you (and me) who tend to think concretely. But if you get past that word, and maybe write an example app or something, I promise these things will make sense.

    replies(1): >>43108095 #
    9. jacobr1 ◴[] No.43106770[source]
    One way to think about it is job orchestration. You end up with some kind of DAG of work to execute. If all the work you are doing is based on context from the initiation of the workflow, then theoretically you could do everything in a single prompt. But more interesting is when there is some kind of real-world interaction, potentially multiple. Such as a websearch, or executing code, calling an API. Then you take action based on the result of then. Which in turn might trigger another decision to take some other action, iteratively, and potentially branching.
    10. jacobr1 ◴[] No.43106822[source]
    > defining discreet units of AI actions that can be chained into workflows that interact with non-ai services.

    You got. But that is the interesting part! To make AI useful, beyond basic content generation in a chat context you need interaction with the outside world. And you may need iterative workflows that can spawn more work based on the output of those interactions. The focus on Agents as personas is a tangent to the core use case. We could just call this stuff "AI Workflow Orchestration" or something ... and it would remain pretty useful!

    replies(1): >>43107864 #
    11. nsonha ◴[] No.43107749[source]
    Without checking out this particular framework, the word is sometimes overloaded with that meaning (LLM personality), but actually in software engineering in general, "agent" generally means something with its own inner loop and branching logic (agent as in autonomy). It's a neccessary abstraction when you compose multiple workflows together under the same LLM interface, things like which flow to run next, and edge case handling for each of them etc.
    12. karn97 ◴[] No.43107864{3}[source]
    I wont trust an agent with anything by itself at their current state though.
    13. fryz ◴[] No.43108095[source]
    To add some color to this

    Anthropic does a good job of breaking down some common architecture around using these components [1] (good outline of this if you prefer video [2]).

    "Agent" is definitely an overloaded term - the best framing of this I've seen is aligns more closely with the Anthropic definition. Specifically, an "agent" is a GenAI system that dynamically identifies the tasks ("steps" from the parent comment) without having to be instructed that those are the steps. There are obvious parallels to the reasoning capabilities that we've seen released in the latest cut of the foundation models.

    So for example, the "Agent" would first build a plan for how to address the query, dynamically farm out the steps in that plan to other LLM calls, and then evaluate execution for correctness/success.

    [1] https://www.anthropic.com/research/building-effective-agents [2] https://www.youtube.com/watch?v=pGdZ2SnrKFU

    replies(1): >>43108671 #
    14. eric-burel ◴[] No.43108671{3}[source]
    This sums up as ranging from multiple LLM calls to build a smart features to letting the LLM decide what to do next. I think you can go very far with the former but the latter is more autonompus in unconstrained environments (like chatting with a human etc.)
    15. ToJans ◴[] No.43111518[source]
    AI seems to forget more things as the context window grows. Agents keep scope local and focused, so you can get better/faster results, or use models trained on specific tasks.

    Just like in real life, there's generalists and experts. Depending on your task you might prefer an expert over a generalist, think f.e. brain surgery versus "summarize this text".

    16. 2pointsomone ◴[] No.43117996{3}[source]
    I think the agent part is deciding how to navigate the web on its own and when it is convinced (and you haven't told it specifically deterministically) it found what it wanted, to come back and work with your prompts. You can't really logic code this into a workflow.
    17. hassleblad23 ◴[] No.43119041[source]
    Now next is to have a Senior Editor and Editor pair :)