Most active commenters
  • TZubiri(17)
  • verdverm(9)
  • darkteflon(3)
  • OutOfHere(3)

129 points ericciarla | 69 comments | | HN request time: 3.514s | source | bottom
1. madrox ◴[] No.40712650[source]
I have a saying: "any sufficiently advanced agent is indistinguishable from a DSL"

If I'm really leaning into multi-tool use for anything resembling a mutation, then I'd like to see an execution plan first. In my experience, asking an AI to code up a script that calls some functions with the same signature as tools and then executing that script actually ends up being more accurate than asking it to internalize its algorithm. Plus, I can audit it before I run it. This is effectively the same as asking it to "think step by step."

I like the idea of Command R+ but multitool feels like barking up the wrong tree. Maybe my use cases are too myopic.

replies(7): >>40713594 #>>40713743 #>>40713985 #>>40714302 #>>40717871 #>>40718481 #>>40721499 #
2. laborcontract ◴[] No.40713024[source]
I really like the stuff coming from Cohere.

I know they're not considered the leader in the foundational model space, but their developer documentation is great, their api is really nice to use, and they have a set of products that really differentiate themselves from OpenAI and Anthropic and others. I'm rooting for the success of this company.

That said, we as an industry need to be moving away from langchain, not more deeply embedding ourselves in that monstrosity. It’s just way too much of its own thing now and you can totally start to see how the VC funding is shaping their incentives. They put everyone who uses it in a position of massive technical debt, create more abstractions like langgraph to lock people into their tools and then and then create paid tools on top of it to solve the problems that they created (langsmith).

replies(3): >>40713141 #>>40713385 #>>40714751 #
3. walterbell ◴[] No.40713141[source]

  massive technical debt
  create more abstractions
  create paid tools.. to solve the problems that they created
Ouroboros worked so well for k8s!
replies(1): >>40714336 #
4. esafak ◴[] No.40713385[source]
Do you use another library instead?
replies(2): >>40714350 #>>40714608 #
5. darkteflon ◴[] No.40713594[source]
You mean manually pre-baking a DAG from the user query, then “spawning” other LLMs to resolve each node and pass their input up the graph? This is the approach we take too. It seems to be a sufficiently performant approach that is - intuitively - generically useful regardless of ontology / domain, but would love to hear others’ experiences.

It would be nice to know if this is sort of how OpenAI’s native “file_search” retriever works - that’s certainly the suggestion in some of the documentation but it hasn’t, to my knowledge, been confirmed.

replies(1): >>40713783 #
6. TZubiri ◴[] No.40713726[source]
I do this for a living, Ask Me Anything.

Before they were called tools they were called function calls in ChatGpt.

Before that we had response_format = "json_object"

And even before that we were prompting with function signatures and asking it to output parameters.

replies(3): >>40715632 #>>40722512 #>>40723032 #
7. TZubiri ◴[] No.40713743[source]
I think you are imagining a scenario where you are using the LLM manually. Tools are designed to serve as a backend for other GPT like products.

You don't have the capacity to "audit" stuff.

Furthermore tool execution occurs not in the LLM but in the code that calls the LLM through API. So whatever code executes the tool, it also orders the calling sequence graph. You don't need to audit it, you are calling it.

replies(1): >>40713878 #
8. TZubiri ◴[] No.40713783{3}[source]
No. The DAG should be "manually pre-baked" ( defined at compile/design time).

In runtime you only parse the "user question" (user prompt) into a starting and end node, which is equivalent to a function call.

So the question

"What league does Messi play in?"

Is parsed by the llm as

League("Messi")

So if your dag only contains the functions team(player) and league(team), you can still solve the question.

But the llm isn't tasked with resolving the dag, that's code, let the llm chill and do what it's good at, don't make it code a for loop for you

replies(2): >>40713798 #>>40714040 #
9. ◴[] No.40713798{4}[source]
10. verdverm ◴[] No.40713878{3}[source]
People want to audit the args, mainly because of the potential for destructive operations like DELETE FROM and rm -rf /

How do you know a malicious actor won't try to do these things? How do you protect against it?

replies(2): >>40713887 #>>40713896 #
11. TZubiri ◴[] No.40713887{4}[source]
"the args"

You need to be more specific. In a systems, everything but the output is an argument to something else. Even then the system output is an input to the user.

So yeah, depending on what argument you are talking about you can audit it in a different way and it has different potential for abuse.

replies(1): >>40714060 #
12. viraptor ◴[] No.40713896{4}[source]
Whitelisting and permissions. You can't issue a delete if anything not starting with SELECT is rejected. You can't have edge cases that work around that via functions, if the user the agent uses doesn't have permissions other than SELECT.
replies(1): >>40714055 #
13. fzeindl ◴[] No.40713985[source]
> ... code up a script that calls some functions with the same signature as tools and then executing that script actually ends up being more accurate than asking it to internalize its algorithm.

This is called defunctionalization and useful without LLMs as well.

replies(1): >>40715127 #
14. darkteflon ◴[] No.40714040{4}[source]
That’s very interesting. Does designing the DAG in advance imply that you have to make a new one for each particular subset of end-user questions you might receive? Or is your problem space such that you can design it once and have it be useful for everything you’re interested in?

My choice of words was poor: by “pre-baking”, I just meant: generated dynamically at runtime from the user’s query, _before_ you then set about answering that query. The nature of our problem space is such that we wouldn’t be able to design DAG in advance of runtime and have it be useful everywhere.

The answering process itself is then handled by deterministically (in code) resolving the dependencies of the DAG in the correct order, where each node might then involve a discrete LLM call (with function) depending on the purpose. Once resolved, a node’s output is passed to the next tier of the DAG with framing context.

replies(1): >>40714255 #
15. verdverm ◴[] No.40714055{5}[source]
"please get all the entries from the table foo and then remove them all"

SELECT * from foo; DELETE FROM foo ...

...because you know people will deploy a general SQL function or agent

replies(3): >>40714088 #>>40714205 #>>40715081 #
16. verdverm ◴[] No.40714060{5}[source]
The args to a function like SQL or TERMINAL
replies(1): >>40714218 #
17. viraptor ◴[] No.40714088{6}[source]
1. Lots of libraries prevent you from submitting multiple queries. It's a good idea to do that in general.

2. If only the second part of my message covered this...

replies(1): >>40714168 #
18. verdverm ◴[] No.40714168{7}[source]
1 & 2. requires that you audit the agents and have uniform permissions, or additional plumbing to lookup user permissions and pass those along.

Have you looked at the agents prepackaged in popular frameworks? They aren't doing permission propagation or using additional libraries as guardrails.

What are most people going to do? This is why people are hesitant and ask about auditability

Considering 2 further, I only described deletion. A read-only database is of limited value. If you have write permissions, you could alternatively change values maliciously, even if you disable deletions. This might not be a malicious, and could be the result of an LLM error or hallucination.

19. TZubiri ◴[] No.40714205{6}[source]
That'S not how it works. The user questions are expressed in business-domain language.

"give me the names of all the employees and then remove them all"

is parsed, maybe as: " employees(), delete(employees())".

It's up to the programmer to define the available functions, if employees() is available, then the first result will be provided, if not it won't.

If the functoin delete with a list of employees as parameter is defined, then that will be executed.

I personally work with existing implementations, traditional software that predates LLMs, typically offered through an API, there's a division of labour, a typical encapsulation at the human organizaiton layer.

Even if you were to directly connect the database to the LLM and let GPT generate SQL queries (which is legitimate), the solution is user/role based permission, a solution as old as UNIX.

Just don't give the LLM or the LLM user-agent write permissions.

replies(1): >>40714219 #
20. TZubiri ◴[] No.40714218{6}[source]
I personally don't connect LLMs to SQL, but to APIs.

But I'm pretty sure you would just give an SQL user to the LLM and enjoy the SQL server's built-in permissions and auditing features.

replies(1): >>40714254 #
21. verdverm ◴[] No.40714219{7}[source]
> That'S not how it works

They are actually quite flexible and you can do anything you want. You supply the LLM with the function names and possible args. I can easily define "sql(query: string)" as a flexible SQL function the LLM can use

re: permissions, as soon as you have write permissions, you have dangerous potential. LLMs are not reliable enough, nor are humans, which is why we use code review.

replies(1): >>40722056 #
22. verdverm ◴[] No.40714254{7}[source]
What if that user has write permissions and the LLM generates a bad UPDATE, i.e. forgets to put the WHERE clause in... even for a SELECT, how do you know the right constraints were in place and you are getting the correct data?

read-only use-cases misses a whole category. All this is to get back to the point that people want to audit the LLM before running the function because of the unreliability, there is hesitance with good reason

replies(3): >>40714408 #>>40714689 #>>40723056 #
23. TZubiri ◴[] No.40714255{5}[source]
You don't make a DAG for each question category. This is classic OOP, OG, Kay's version, you design subject-experts (objects) with autonomy and independency, they are just helpful in general. Each function/method, regardless of the Object/Expert, is an edge in the graph. A user question is simply a pair of vertices, call them I and O, and the execution/solution is a path between the two points, namely the input and the output.

The functions are traditional software (Code, API, SQL) the job of the LLM is only to:

1- Map each type of question into a subsystem/codepath. The functional parsing solution is the most advanced. But a simple version involves asking LLM to classify a question into an enum.

2- To parse the parameters as a list of key/value tuples.

The end. Don't ask the LLM to cook your food, clean your clothes or suck your dick. LLM is revolutionary at language, let it do language tasks.

We are not consumers of a helpful AI assistant, we are designers of it.

replies(2): >>40714793 #>>40717093 #
24. TZubiri ◴[] No.40714302[source]
"Agent ~=Domain Syntax Language"

But Agent!=Language

25. TZubiri ◴[] No.40714336{3}[source]
On the one hand yes, that can happen.

On the other hand, it may be a legitimate monetization strategy for Open Source libraries.

Additionally, Langchain does have a role on R&D, you can use it for experimental projects. Simply deduct the self-preservating aspects of it and try to learn from its ideas, test them in non-critical projects. If it works, you can then easily replicate it with an internal tool or just plain code.

Also, it's an Open Source library, how much vendor-lock can you have if you control the code and the server? The actual dependency is on the LLM provider, and if you use something like Meta's LLama you can self-host it as well.

26. laborcontract ◴[] No.40714350{3}[source]
From the tooling perspective I’ve built my own.

Recently, I've been toying around with litellm which, so far, strikes me as the right level of abstraction. I like building my own stuff but writing api wrappers just suck.

I’ve also been toying around with Instructor for structured output as well. It’s incredibly convenient, but I haven’t used it for any production stuff because I don’t feel comfortable with the prompting aspect yet.

replies(1): >>40714969 #
27. politelemon ◴[] No.40714355[source]
The sample notebook linked from the post is a 404

https://github.com/cohere-ai/notebooks/blob/main/notebooks/D...

replies(1): >>40714744 #
28. TZubiri ◴[] No.40714408{8}[source]
No, the human user doesn't have permissions, the LLM system has permissions, we create a user for the process, we've been doing this since unix, take a look at what your HTTP server runs as. There's no deputization of permissions going on here, at least on my systems.

Even if there are user-level permissions, you then use a role-based approach (SQL user for a type of users, for example accountant, manager, etc..) and restrict its permissions accordingly, I don't think the idea of restricting permissions so that we avoid users fucking the database up is new.

Many organizations have DBA whose role it is to convert user queries into SQL queries, Juniors usually have tighter permissions. Also non-technical managers and analysts can have access to the database.

As I said, not a new problem, SQL servers have mature permission systems.

If that is not enough, just write an API wrapper. It's what Amazon does anyways, Bezos' memo explicitly states that teams should not expose databases, rather they should expose APIs, under punishment of firing.

replies(1): >>40714420 #
29. verdverm ◴[] No.40714420{9}[source]
and even with that permission system, mistakes still happen, we haven't even been able to eliminate sql injection in real systems, so these things can and will happen

adding LLMs in means we have an unaudited query producer, that is the point OP is trying to make, that is something they want to avoid and audit the function call before it happens, because we know the LLMs are not even at our level yet, and we make mistakes and we use code review to reduce them

and again, even in a read-only system, we have removed the guardrails of a human designed form with constraints and replaced it with an unaudited LLM that we can no longer be certain returns the correct or consistent results. People are rightly cautious and hesitant, preferring a system they use as a peer and can audit or review

replies(1): >>40722064 #
30. tucnak ◴[] No.40714608{3}[source]
Dify.ai is great
31. _puk ◴[] No.40714689{8}[source]
> All this is to get back to the point that people want to audit the LLM before running the function because of the unreliability, there is hesitance with good reason.

some people - I think it's quite clear from this thread that not everyone feels the need to.

I'm now thinking requesting the LLM also output its whole prompt to something like a Datadog trace function would be quite useful for review / traceability.

replies(1): >>40715489 #
32. mostelato ◴[] No.40714744[source]
Here is the correct link: https://github.com/cohere-ai/notebooks/blob/main/notebooks/a...

And here are the docs: https://docs.cohere.com/docs/multi-step-tool-use

33. mostelato ◴[] No.40714751[source]
You don't need to use langchain with Cohere, it's just nice for demos since Langchain comes with pre-built tools.

Check out the examples here: https://docs.cohere.com/docs/multi-step-tool-use

and this notebook https://github.com/cohere-ai/notebooks/blob/main/notebooks/a...

34. darkteflon ◴[] No.40714793{6}[source]
Very interesting perspective - thanks for your time!
35. vinhnx ◴[] No.40714969{4}[source]
+1 for LiteLLM as well, I've been experimenting with LiteLLM for integrating multiple LLM providers and building my own AI chatbot, which is evolving into a multimodal AI chatbot. It's been a positive experience so far. [0]

On the other hand, I found Langchain less impressive. It feels somewhat vague and not very beginner-friendly, catering more to intermediate users.

[0] https://github.com/vinhnx/VT.ai

36. TZubiri ◴[] No.40715081{6}[source]
btw that's not how tools work at all. Tools are function/API based. (Unless you expose a function run_sql(query), but that's on you.)
replies(1): >>40715543 #
37. HeatrayEnjoyer ◴[] No.40715127{3}[source]
Like what?
38. Oras ◴[] No.40715170[source]
Cohere is underrated, I recently tried cohere R model and found it following prompt much better than gpt-4o and even Claude opus.

That’s said, it’s a bit annoying to see langchain examples all over. Not everyone uses it, and many consider it bloated and hard to maintain.

Would be great just to have a simple example in Python showing the capabilities.

replies(1): >>40720219 #
39. verdverm ◴[] No.40715489{9}[source]
Most LLM observability tools do this

I'm currently using LangFuse and exploring OpenLit because it integrates with Otel, which you should be able to forward to Datadog iirc their docs

replies(1): >>40718670 #
40. verdverm ◴[] No.40715543{7}[source]
I brought it up because popular frameworks are offering this type of agent or function out of the box

There is no "way that tools work"

You pass OpenAPI like schemas along with the prompt and you get back a JSON object. The rest is code and you can do anything you want with. The LLM is merely mapping from unstructured text onto a schema best it can, and we know they are imperfect.

replies(1): >>40723047 #
41. cpursley ◴[] No.40715632[source]
Are you working on a product or doing consulting?
replies(1): >>40721312 #
42. Scipio_Afri ◴[] No.40717093{6}[source]
Thanks for your perspective! Can you explain why oral is a bad idea?
replies(1): >>40718162 #
43. RecycledEle ◴[] No.40717871[source]
> I have a saying: "any sufficiently advanced agent is indistinguishable from a DSL"

I don't think you mean Digital Subscriber Line, so may I ask: What is a DSL in this context?

replies(1): >>40721455 #
44. TZubiri ◴[] No.40718162{7}[source]
IO STDs
45. ai4ever ◴[] No.40718481[source]
putting the LLM in the loop makes the tooling unreliable. so, the usecases would be limited to those where accuracy is not important.

whereas, a DSL still aims for accurate and deterministic modeling of the specific usecase.

46. kakaly0403 ◴[] No.40718670{10}[source]
Checkout Langtrace. It’s also OTEL and integrates with datadog.

https://github.com/Scale3-Labs/langtrace

47. OutOfHere ◴[] No.40719915[source]
I have developed multiple multi-step LLM workflows, expressible as both conditional and parallel DAGs, using mostly plain Python, and I still don't understand why these langchain-type libraries feel the need to exist. Plain Python is quite sufficient for advanced LLM workflows if you know how to use it.

LLMs are innately unreliable, and they require a lot of hand-holding and prompt-tuning to get them to work well. Getting into the low-level details of the prompts is too essential. I don't want any libraries to come in the way because I have to be able to find and cleverly prevent the failure cases that happen just 1 in 500 times.

These libraries seem to mainly just advertise each other. If I am missing something, I don't know what it is.

replies(3): >>40720149 #>>40720202 #>>40720512 #
48. leobg ◴[] No.40720149[source]
Always felt the same way, but could never put it in words as eloquently as you just did. Python (or any other programming language) already is the best glue. With these frameworks, you just waste brain cycles on learning APIs that change and break every couple of months.
49. SeriousStorm ◴[] No.40720202[source]
Every time I look into building a workflow with langchain it seems unnecessarily complex. So I end up stopping.

Are you just running an LLM server (Ollama, llama.cpp, etc) and then making API calls to that server with plain Python or is it more than that?

replies(1): >>40720334 #
50. el-ai-ne ◴[] No.40720219[source]
Hey, I work at Cohere and I'm stoked to hear Command R is following prompt instructions well. Appreciate the feedback and here is a simple example in Python: https://docs.cohere.com/docs/multi-step-tool-use#using-the-c...

The following cookbooks contain slightly more advanced code examples, using just the cohere API for multi-step: https://docs.cohere.com/page/calendar-agent https://docs.cohere.com/page/pdf-extractor https://docs.cohere.com/page/agentic-multi-stage-rag

Cheers

51. OutOfHere ◴[] No.40720334{3}[source]
I suppose ollama and llama.cpp, or at least any corresponding Python SDKs, would be good for using self-hosted models, especially if they support parallel GPU use. If it's something custom, Pytorch would come into the picture. In production workflows, it can obviously be useful to run certain LLM prompts in parallel to hasten the job.

For now I have used only cloud APIs with their Python SDKs, including the prompt completion, TTS, and embedding endpoints. They allow me to run many jobs in parallel which is useful for complex workflows or if facing heavy user demand. For caching of responses, I have used a local disk caching library, although I guess one can alternatively use a standalone or embedded database. I have used threading via `concurrent.futures` for concurrent jobs, although asyncio too would work.

The one simple external Python library I found so far is `semantic-text-splitter` for splitting long texts using token counts, but this too I could have done by myself with a bit of effort. I think langchain has something for it too.

52. etse ◴[] No.40720512[source]
If you wanted to compare OpenAI models against Anthropic or Google, wouldn't the framework help a lot? Breaking APIs is more about bad framework development than frameworks in general.

I think frameworks tend to provide an escape hatch. LlamaIndex comes to mind. It seems to me that by not learning and using an existing framework, you're building your own, which is a calculated tradeoff.

replies(1): >>40721015 #
53. OutOfHere ◴[] No.40721015{3}[source]
That is a good use case and it's a good problem to have, certainly the kind I wanted to hear, but it's not a problem I have had yet.

Moreover, I absolutely expect to have to update my prompts if I have to support a different model, even if its a different model by the same provider. For example, there is a difference in the behavior of gpt4-turbo vs gpt4-o even though both are by OpenAI.

Specific LLMs have specific tendencies and preferences which one has to work with. What I'm saying is that the framework will help, but it's not as simple as switching the model class.

replies(1): >>40725205 #
54. danw1979 ◴[] No.40721261[source]
It feels like the industry is heading towards automating everything using non-deterministic black boxes stuck together with proprietary glue.
replies(1): >>40722464 #
55. TZubiri ◴[] No.40721312{3}[source]
Consulting: goldenwordai.com

Product (Unreleased): silverletterai.com

56. patleeman ◴[] No.40721455{3}[source]
Domain Specific Language
replies(1): >>40746376 #
57. patleeman ◴[] No.40721499[source]
Very curious if anybody else is working on a DSL that's meant for LLMs to output?

Has anyone seen anyone using this approach? Any resources available?

replies(1): >>40744275 #
58. TZubiri ◴[] No.40722056{8}[source]
Correct, by "it" I was referring specifically to the "Tools" functionality that is the subject of the linked article.

Tools DON'T generate sql queries, they generate function calls. Of course you can hack it to output sql queries, you can always hack something beyond its intended design purpose, but it's not what it's supposed to be doing.

Re: permissions, nothing new here, give your LLM agent the permissions of a Junior DBA.

59. TZubiri ◴[] No.40722064{10}[source]
Again, SQL query generating agents are not the subject of the original article.
60. lnrd ◴[] No.40722464[source]
It's something that we know it will backfire very spectacularly, but might happen anyway. I could see this backfire and have a industry wide reversal to other technologies. I have a gut feeling that already happened in other areas, like companies going back to bare-metal, but it's not really the best example sine cloud was and still is the best solution for most companies.

Does anyone with more experience than me have memories of similar things happening? Where a technology was hyped and adopted anywhere until something happened that caused an industry-wide reversal to more established ways of doing things?

61. openmajestic ◴[] No.40722512[source]
It was called tool use before ChatGpt existed.
62. bhl ◴[] No.40723032[source]
What’s the difference between tool calling and “agents”?

How would you handle map-reduce type of tool calls where you have a lot of parallel tools that you want to merge later on? What’s a good way to scale that without running into API limits?

63. TZubiri ◴[] No.40723047{8}[source]
https://en.wikipedia.org/wiki/Robustness_principle

"be conservative in what you send, be liberal in what you accept"

LLM parses text into a list of parameters. You design your function such that it is safe regardless of what the parameters are.

64. TZubiri ◴[] No.40723056{8}[source]
How about stored procedures for Write operations?
65. etse ◴[] No.40725205{4}[source]
I'm not quite understanding how different prompts for different models reduces the attractiveness of a framework. A framework could theoretically have an LLM evals package to run continuous experiments of all prompts against across all models.

Also theoretically, an LLM framework could estimate costs, count tokens, offer a variety of chunking strategies, unify the more sophisticated APIs, like tools or agents–all of which could vary from provider to provider.

Admittedly, this view came just from doing early product explorations, but a framework was helpful for most of the above reasons (I didn't find an evals framework that I liked).

You mentioned not having this problem yet. What kind of problems have you been running across? I'm wondering if I'm missing some other context.

66. classified ◴[] No.40725210[source]
Sensational news: LLM can flip multiple bits at once with one request! This is so awesome. How could our CPUs ever work without LLMs built in? I bet IBM had a secret LLM whisperer in all of their mainframes. To this day.
67. madrox ◴[] No.40744275{3}[source]
Funny enough, there's a DSL LLMs already tend to be trained on: JavaScript. Just teach the LLM the function signatures it has to work with, then execute the script in a VM
68. RecycledEle ◴[] No.40746376{4}[source]
Thank you.