GGML – AI at the Edge

(ggml.ai)

Show context

samwillis ◴[06 Jun 23 17:28 UTC] No.36216196[source]▶

ggml and llama.cpp are such a good platform for local LLMs, having some financial backing to support development is brilliant. We should be concentrating as much as possible to do local inference (and training) based on privet data.

I want a local ChatGPT fine tuned on my personal data running on my own device, not in the cloud. Ideally open source too, llama.cpp is looking like the best bet to achieve that!

replies(6): >>36216377 #>>36216465 #>>36216508 #>>36217604 #>>36217847 #>>36221973 #

1. SparkyMcUnicorn ◴[06 Jun 23 19:07 UTC] No.36217604[source]▶

>>36216196 #

Maybe I'm wrong, but I don't think you want it fine-tuned on your data.

Pretty sure you might be looking for this: https://github.com/SamurAIGPT/privateGPT

Fine-tuning is good for treating it how to act, but not great for reciting/recalling data.

replies(4): >>36219307 #>>36220595 #>>36226771 #>>36241658 #

2. dr_dshiv ◴[06 Jun 23 21:24 UTC] No.36219307[source]▶

>>36217604 (TP) #

How does this work?

replies(2): >>36219423 #>>36220553 #

3. deet ◴[06 Jun 23 21:38 UTC] No.36219423[source]▶

>>36219307 #

The parent is saying that "fine tuning", which has a specific meaning related to actually retraining the model itself (or layers at its surface) on a specialized set of data, is not what the GP is actually looking for.

An alternative method is to index content in a database and then insert contextual hints into the LLM's prompt that give it extra information and detail with which to respond with an answer on-the-fly.

That database can use semantic similarity (ie via a vector database), keyword search, or other ranking methods to decide what context to inject into the prompt.

PrivateGPT is doing this method, reading files, extracting their content, splitting the documents into small-enough-to-fit-into-prompt bits, and then indexing into a database. Then, at query time, it inserts context into the LLM prompt

The repo uses LangChain as boilerplate but it's pretty easily to do manually or with other frameworks.

(PS if anyone wants this type of local LLM + document Q/A and agents, it's something I'm working on as supported product integrated into macOS, and using ggml; see profile)

4. SparkyMcUnicorn ◴[06 Jun 23 23:29 UTC] No.36220553[source]▶

>>36219307 #

deet already gave a comprehensive answer, but I'll add that the guts of privateGPT are pretty readable and only ~200 lines of code.

Core pieces: GPT4All (LLM interface/bindings), Chroma (vector store), HuggingFaceEmbeddings (for embeddings), and Langchain to tie everything together.

https://github.com/SamurAIGPT/privateGPT/blob/main/server/pr...

5. ◴[06 Jun 23 23:34 UTC] No.36220595[source]▶

>>36217604 (TP) #

6. gtirloni ◴[07 Jun 23 13:50 UTC] No.36226771[source]▶

>>36217604 (TP) #

> Fine-tuning is good for treating it how to act, but not great for reciting/recalling data.

What underlying process makes it this way? Is it because the prompt has heavier weight?

replies(2): >>36229475 #>>36242863 #

7. SparkyMcUnicorn ◴[07 Jun 23 16:47 UTC] No.36229475[source]▶

>>36226771 #

I think your question is asking about the fundamentals of how an LLM works, which I'm not really qualified to answer. But I do have a general understanding of it all.

Fine-tuning is like having the model take a class on a certain subject. By the end of the class, it's going to have a general understanding on how to do that thing, but it's probably going to struggle when trying to quote the textbooks verbatim.

A good use-case for fine-tuning is teaching it a response style or format. If you fine-tune a model to only respond in JSON, then you no longer need to include formatting instructions in your prompt to get a JSON output.

8. SkyPuncher ◴[08 Jun 23 13:36 UTC] No.36241658[source]▶

>>36217604 (TP) #

I think people want both. They want fine tuning for their style of communication and interaction. They want better rank and retrieval for rote information.

In other words, it’s like having a spouse/partner. There are certain ways that we communicate that we simply know where the other person is at or what they actually mean.

replies(1): >>36243366 #

9. bluepoint ◴[08 Jun 23 14:53 UTC] No.36242863[source]▶

>>36226771 #

I just read the paper about LORA. The main idea is that you write the weights of each neural network as

W = W0 + B A

Where W0 is the trained model’s weights, which are kept fixed, and A and B are matrices but with a much much lower rank than the originals (say r = 4).

It has been shown (as mentioned in the lora paper that training for specific tasks results in low rank corrections, so this is what it is all about. I think that doing LoRa can be done locally.

[1] https://github.com/microsoft/LoRA

10. SparkyMcUnicorn ◴[08 Jun 23 15:19 UTC] No.36243366[source]▶

>>36241658 #

Unless you want machine-readable responses, or some other very specific need, the benefits of a fine-tuned model aren't really going to be that much better than a prompt that asks for the style you want along with an example or two. It also raises the barrier to entry quite a bit, since the majority of computers that can run the model aren't capable of training on it.

Even if you're using OpenAI's models, gpt-3.5-turbo is going to be much better (cheaper, bigger context window, higher quality) than any of their models that can be fine-tuned.

But if you're able to fine-tune a local model, then a combination of fine-tuning and embedding is probably going to give you better results than embedding alone.

↑