I want a local ChatGPT fine tuned on my personal data running on my own device, not in the cloud. Ideally open source too, llama.cpp is looking like the best bet to achieve that!
I want a local ChatGPT fine tuned on my personal data running on my own device, not in the cloud. Ideally open source too, llama.cpp is looking like the best bet to achieve that!
Pretty sure you might be looking for this: https://github.com/SamurAIGPT/privateGPT
Fine-tuning is good for treating it how to act, but not great for reciting/recalling data.
An alternative method is to index content in a database and then insert contextual hints into the LLM's prompt that give it extra information and detail with which to respond with an answer on-the-fly.
That database can use semantic similarity (ie via a vector database), keyword search, or other ranking methods to decide what context to inject into the prompt.
PrivateGPT is doing this method, reading files, extracting their content, splitting the documents into small-enough-to-fit-into-prompt bits, and then indexing into a database. Then, at query time, it inserts context into the LLM prompt
The repo uses LangChain as boilerplate but it's pretty easily to do manually or with other frameworks.
(PS if anyone wants this type of local LLM + document Q/A and agents, it's something I'm working on as supported product integrated into macOS, and using ggml; see profile)
Core pieces: GPT4All (LLM interface/bindings), Chroma (vector store), HuggingFaceEmbeddings (for embeddings), and Langchain to tie everything together.
https://github.com/SamurAIGPT/privateGPT/blob/main/server/pr...
Fine-tuning is like having the model take a class on a certain subject. By the end of the class, it's going to have a general understanding on how to do that thing, but it's probably going to struggle when trying to quote the textbooks verbatim.
A good use-case for fine-tuning is teaching it a response style or format. If you fine-tune a model to only respond in JSON, then you no longer need to include formatting instructions in your prompt to get a JSON output.
In other words, it’s like having a spouse/partner. There are certain ways that we communicate that we simply know where the other person is at or what they actually mean.
W = W0 + B A
Where W0 is the trained model’s weights, which are kept fixed, and A and B are matrices but with a much much lower rank than the originals (say r = 4).
It has been shown (as mentioned in the lora paper that training for specific tasks results in low rank corrections, so this is what it is all about. I think that doing LoRa can be done locally.
Even if you're using OpenAI's models, gpt-3.5-turbo is going to be much better (cheaper, bigger context window, higher quality) than any of their models that can be fine-tuned.
But if you're able to fine-tune a local model, then a combination of fine-tuning and embedding is probably going to give you better results than embedding alone.