LLM that can call multiple tool APIs with one request

(cohere.com)

129 points ericciarla | 1 comments | 17 Jun 24 21:47 UTC | HN request time: 0.206s | source

Show context

OutOfHere ◴[18 Jun 24 17:01 UTC] No.40719915[source]▶

I have developed multiple multi-step LLM workflows, expressible as both conditional and parallel DAGs, using mostly plain Python, and I still don't understand why these langchain-type libraries feel the need to exist. Plain Python is quite sufficient for advanced LLM workflows if you know how to use it.

LLMs are innately unreliable, and they require a lot of hand-holding and prompt-tuning to get them to work well. Getting into the low-level details of the prompts is too essential. I don't want any libraries to come in the way because I have to be able to find and cleverly prevent the failure cases that happen just 1 in 500 times.

These libraries seem to mainly just advertise each other. If I am missing something, I don't know what it is.

replies(3): >>40720149 #>>40720202 #>>40720512 #

SeriousStorm ◴[18 Jun 24 17:34 UTC] No.40720202[source]▶

>>40719915 #

Every time I look into building a workflow with langchain it seems unnecessarily complex. So I end up stopping.

Are you just running an LLM server (Ollama, llama.cpp, etc) and then making API calls to that server with plain Python or is it more than that?

replies(1): >>40720334 #

1. OutOfHere ◴[18 Jun 24 17:47 UTC] No.40720334[source]▶

>>40720202 #

I suppose ollama and llama.cpp, or at least any corresponding Python SDKs, would be good for using self-hosted models, especially if they support parallel GPU use. If it's something custom, Pytorch would come into the picture. In production workflows, it can obviously be useful to run certain LLM prompts in parallel to hasten the job.

For now I have used only cloud APIs with their Python SDKs, including the prompt completion, TTS, and embedding endpoints. They allow me to run many jobs in parallel which is useful for complex workflows or if facing heavy user demand. For caching of responses, I have used a local disk caching library, although I guess one can alternatively use a standalone or embedded database. I have used threading via `concurrent.futures` for concurrent jobs, although asyncio too would work.

The one simple external Python library I found so far is `semantic-text-splitter` for splitting long texts using token counts, but this too I could have done by myself with a bit of effort. I think langchain has something for it too.

↑