Most active commenters
  • jart(8)
  • brucethemoose2(7)
  • LoganDark(6)
  • SparkyMcUnicorn(6)
  • okhuman(6)
  • (6)
  • smiley1437(5)
  • baobabKoodaa(4)
  • Tan-Aki(4)
  • jgrahamc(4)

899 points georgehill | 243 comments | | HN request time: 2.381s | source | bottom
1. Havoc ◴[] No.36215833[source]
How common is avx on edge platforms?
replies(2): >>36216269 #>>36217034 #
2. ◴[] No.36215875[source]
3. _20p0 ◴[] No.36215876[source]
This guy is damned good. I sponsored him on Github because his software is dope. I also like how when some controversy erupted on the project he just ejected the controversial people and moved on. Good stewardship. Great code.

I recall something like when he first ported it and it worked on my M1 Max he hadn't even yet tested it on Apple Silicon since he didn't have the hardware.

Honestly, with this and whisper, I am a huge fan. Good luck to him and the new company.

replies(4): >>36216131 #>>36216191 #>>36216199 #>>36216264 #
4. TechBro8615 ◴[] No.36215882[source]
I believe ggml is the basis of llama.cpp (the OP says it's "used by llama.cpp")? I don't know much about either, but when I read the llama.cpp code to see how it was created so quickly, I got the sense that the original project was ggml, given the amount of pasted code I saw. It seemed like quite an impressive library.
replies(2): >>36215954 #>>36218722 #
5. rvz ◴[] No.36215936[source]
> Nat Friedman and Daniel Gross provided the pre-seed funding.

Why? Why should VCs get involved again?

They are just going to look for an exit and end up getting acquired by Apple Inc.

Not again.

replies(5): >>36215977 #>>36216061 #>>36216214 #>>36216267 #>>36239156 #
6. kgwgk ◴[] No.36215954[source]
https://news.ycombinator.com/item?id=33877893

“OpenAI recently released a model for automatic speech recognition called Whisper. I decided to reimplement the inference of the model from scratch using C/C++. To achieve this I implemented a minimalistic tensor library in C and ported the high-level architecture of the model in C++.”

That “minimalistic tensor library” was ggml.

7. _20p0 ◴[] No.36215977[source]
It's possible to do whatever you want without VCs. The code is open source so you can start where he's starting from and run a purely different enterprise if you desire.
8. danieljanes ◴[] No.36216025[source]
Does GGML support training on the edge? We're especially interested in training support for Android+iOS
replies(3): >>36216069 #>>36216179 #>>36225580 #
9. throw74775 ◴[] No.36216061[source]
Do you have pre-seed funding to give him?
replies(1): >>36216217 #
10. ◴[] No.36216068[source]
11. ◴[] No.36216069[source]
12. nivekney ◴[] No.36216106[source]
On a similar thread, how does it compare to Hippoml?

Context: https://news.ycombinator.com/item?id=36168666

replies(1): >>36216469 #
13. killthebuddha ◴[] No.36216131[source]
Another important detail about the ejections that I think is particularly classy is that the people he ejected are broadly considered to have world-class technical skills. In other words, he was very explicitly prioritizing collaborative potential > technical skill. Maybe a future BDFL[1]!

[1] https://en.wikipedia.org/wiki/Benevolent_dictator_for_life

replies(1): >>36219666 #
14. world2vec ◴[] No.36216161[source]
Might be a silly question but is GGML a similar/competing library to George Hotz's tinygrad [0]?

[0] https://github.com/geohot/tinygrad

replies(2): >>36216187 #>>36218539 #
15. svantana ◴[] No.36216179[source]
Yes - look at the file tests/test-opt.c. Unfortunately there's almost no documentation about its training/autodiff.
16. qeternity ◴[] No.36216187[source]
No, GGML is a CPU optimized library and quantized weight format that is closely linked to his other project llama.cpp
replies(2): >>36216244 #>>36216266 #
17. evanwise ◴[] No.36216191[source]
What was the controversy?
replies(2): >>36216394 #>>36216585 #
18. samwillis ◴[] No.36216196[source]
ggml and llama.cpp are such a good platform for local LLMs, having some financial backing to support development is brilliant. We should be concentrating as much as possible to do local inference (and training) based on privet data.

I want a local ChatGPT fine tuned on my personal data running on my own device, not in the cloud. Ideally open source too, llama.cpp is looking like the best bet to achieve that!

replies(6): >>36216377 #>>36216465 #>>36216508 #>>36217604 #>>36217847 #>>36221973 #
19. nchudleigh ◴[] No.36216199[source]
he has been amazing to watch and has even helped me out with my app that uses his whisper.cpp project (https://superwhisper.com)

Excited to see how his venture goes!

20. sroussey ◴[] No.36216214[source]
Daniel Gross is a good guy, a yes his company did get acquired by apple a while back, but he loves to foster really dope stuff by amazing people, and ggml certainly fits the bill. And this looks like an Angel investment, not a VC one if that makes any difference to you.
replies(1): >>36224448 #
21. jgrahamc ◴[] No.36216217{3}[source]
I do.
replies(1): >>36238452 #
22. stri8ed ◴[] No.36216244{3}[source]
How does the quantization happen? Are the weights preprocessed before loading the model?
replies(2): >>36216303 #>>36216321 #
23. PrimeMcFly ◴[] No.36216264[source]
> I also like how when some controversy erupted on the project he just ejected the controversial people and moved on. Good stewardship

Do you have more info on the controversy? I'm not sure ejecting developers just because of controversy is honestly good stewardship.

replies(2): >>36216584 #>>36218505 #
24. ggerganov ◴[] No.36216266{3}[source]
ggml started with focus on CPU inference, but lately we have been augmenting it with GPU support. Although still in development, it already has partial CUDA, OpenCL and Metal backend support
replies(3): >>36216327 #>>36216442 #>>36219452 #
25. okhuman ◴[] No.36216267[source]
+1. VC involvement in projects like these always pivot the team away from the core competency of what you'd expect them to deliver - into some commercialization aspect that convert only a tiny fraction of the community yet take up 60%+ of the core developer team's time.

I don't know why project founders head this way...as the track records of leaders who do this end up disappointing the involved community at some point. Look to matt klein + cloud native computing foundation at envoy for a somewhat decent model of how to do this better.

We continue down the Open Core model yet it continues to fail communities.

replies(2): >>36216886 #>>36218615 #
26. svantana ◴[] No.36216269[source]
Edge just means that the computing is done close to the I/O data, so that includes PCs and such.
27. sebzim4500 ◴[] No.36216303{4}[source]
Yes, but to my knowledge it doesn't do any of the complicated optimization stuff that SOTA quantisation methods use. It basically is just doing a bunch of rounding.

There are advantages to simplicity, after all.

replies(1): >>36216416 #
28. kretaceous ◴[] No.36216311[source]
Georgi's Twitter announcement: https://twitter.com/ggerganov/status/1666120568993730561
replies(1): >>36216686 #
29. ggerganov ◴[] No.36216321{4}[source]
The weights are preprocessed into integer quants combined with scaling factors in various configurations (4, 5, 8-bits and recently more exotic 2, 3 and 6-bit quants). At runtime, we use efficient SIMD implementations to perform the matrix multiplication at integer level, carefully optimizing for both compute and memory bandwidth. Similar strategies are applied when running GPU inference - using custom kernels for fast Matrix x Vector multiplications
30. qeternity ◴[] No.36216327{4}[source]
Hi Georgi - thanks for all the work, have been following and using since the availability of Llama base layers!

Wasn’t implying it’s CPU only, just that it started as a CPU optimized library.

31. yukIttEft ◴[] No.36216376[source]
Its graph execution is still full of busyloops, e.g.:

https://github.com/ggerganov/llama.cpp/blob/44f906e8537fcec9...

I wonder how much more efficient it would be when Taskflow lib was used instead, or even inteltbb.

replies(4): >>36217006 #>>36217540 #>>36217840 #>>36218226 #
32. brucethemoose2 ◴[] No.36216377[source]
If MeZO gets implemented, we are basically there: https://github.com/princeton-nlp/MeZO
replies(1): >>36216988 #
33. pubby ◴[] No.36216394{3}[source]
https://github.com/ggerganov/llama.cpp/pull/711
34. brucethemoose2 ◴[] No.36216416{5}[source]
Its not so simple anymore, see https://github.com/ggerganov/llama.cpp/pull/1684
35. freedomben ◴[] No.36216442{4}[source]
As a person burned by nvidia, I can't thank you enough for the OpenCL support
36. rvz ◴[] No.36216465[source]
> ggml and llama.cpp are such a good platform for local LLMs, having some financial backing to support development is brilliant

The problem is, this financial backing and support is via VCs, who will steer the project to close it all up again.

> I want a local ChatGPT fine tuned on my personal data running on my own device, not in the cloud. Ideally open source too, llama.cpp is looking like the best bet to achieve that!

I think you are setting yourself up for disappointment in the future.

replies(3): >>36216838 #>>36217184 #>>36219154 #
37. brucethemoose2 ◴[] No.36216469[source]
We don't necessarily know... Hippo is closed source for now.

Its comparable to Apache TVM's vulkan in speed on cuda, see https://github.com/mlc-ai/mlc-llm

But honestly, the biggest advantage of llama.cpp for me is being able to split a model so performantly. My puny 16GB laptop can just barely, but very practically, run LLaMA 30B at almost 3 tokens/s, and do it right now. That is crazy!

replies(1): >>36217701 #
38. behnamoh ◴[] No.36216508[source]
I wonder if ClosedAI and other companies use the findings of the open source community in their products. For example, do they use QLORA to reduce the costs of training and inference? Do they quantize their models to serve non-subscribing consumers?
replies(2): >>36216688 #>>36217149 #
39. mliker ◴[] No.36216559[source]
congrats! I was just listening to your changelog interview from months ago in which you said you were going to move on from this after you brush up the code a bit, but it seems the momentum is too great. Glad to see you carrying this amazing project(s) forward!
40. freedomben ◴[] No.36216584{3}[source]
Right. More details needed to know if this is good stewardship (ejecting two toxic individuals) or laziness (ejecting a villain and a hero to get rid of the "problem" easily). TikTok was using this method for a while by ejecting both bullies and victims, and it "solved" the problem but most people see the injustice there.

I'm not saying it was bad stewardship, I honestly don't know. I just agree that we shouldn't make a judgment without more information.

replies(3): >>36216964 #>>36218213 #>>36218398 #
41. kgwgk ◴[] No.36216585{3}[source]
https://news.ycombinator.com/item?id=35411909
42. jgrahamc ◴[] No.36216686[source]
Cool. I've just started sponsoring him on GitHub.
43. danielbln ◴[] No.36216688{3}[source]
Not disagreeing with your points, but saying "ClosedAI" is about as clever as writing M$ for Microsoft back in the day, which is to say not very.
replies(4): >>36216958 #>>36217145 #>>36218362 #>>36218979 #
44. okhuman ◴[] No.36216775[source]
The establishment of ggml.ai a company focusing ggml and llama.cpp, the most innovative and exciting platform to come for local LLMs, on a Open Core model is just laziness.

Just because you can (and have the connections), doesn't mean you should. It's a sad state of OSS when the best most brightest developers/founders reach for antiquated models.

Maybe we take up a new rules in OSS communities that say you must release your CORE software as MIT at the same time you plan to go Open Core (and no sooner).

Why should OSS communities take on your product market fit?!

replies(1): >>36216855 #
45. ulchar ◴[] No.36216838{3}[source]
> The problem is, this financial backing and support is via VCs, who will steer the project to close it all up again.

How exactly could they meaningfully do that? Genuine question. The issue with the OpenAI business model is that the collaboration within academia and open source circles is creating innovations that are on track to out-pace the closed source approach. Does OpenAI have the pockets to buy the open source collaborators and researchers?

I'm truly cynical about many aspects of the tech industry but this is one of those fights that open source could win for the betterment of everybody.

replies(2): >>36217177 #>>36217454 #
46. wmf ◴[] No.36216855[source]
This looks off-topic since GGML has not announced anything about open core and their software is already MIT.

More generally, if you want to take away somebody's business model you need to provide one that works. It isn't easy.

replies(1): >>36217213 #
47. wmf ◴[] No.36216886{3}[source]
Developers shouldn't be unpaid slaves to the community.
replies(1): >>36217483 #
48. conjecTech ◴[] No.36216903[source]
Congratulations! How do you plan to make money?
replies(1): >>36217079 #
49. loa_in_ ◴[] No.36216958{4}[source]
I'd say saying M$ makes it harder for M$ to find out I'm talking about them in them in the indexed web because it's more ambiguous, that's all I need to know.
replies(1): >>36218186 #
50. csmpltn ◴[] No.36216964{4}[source]
> More details needed to know if this is good stewardship (ejecting two toxic individuals) or laziness (ejecting a villain and a hero to get rid of the "problem" easily).

Man, nobody has time for this shit. Leave the games and the drama for the social justice warriors and the furries. People building shit ain't got time for this - ejecting trouble makers is the right way to go regardless of which "side" they're on.

replies(3): >>36217227 #>>36218052 #>>36218144 #
51. moffkalast ◴[] No.36216988{3}[source]
Basically there, with what kind of VRAM and processing requirements? I doubt anyone running on a CPU can fine tune in a time frame that doesn't give them an obsolete model when they're done.
replies(1): >>36217136 #
52. moffkalast ◴[] No.36217006[source]
Someone ought to be along with a PR eventually.
53. binarymax ◴[] No.36217034[source]
svantana is correct that PCs are edge, but if you meant "mobile", then ARM in iOS and Android typically have NEON instructions for SIMD, not AVX: https://developer.arm.com/Architectures/Neon
replies(1): >>36217403 #
54. ggerganov ◴[] No.36217079[source]
I'm planning to write code and have fun!
replies(3): >>36217727 #>>36219393 #>>36219661 #
55. nl ◴[] No.36217136{4}[source]
According to the paper it fine tunes at the speed of inference (!!)

This would make fine tuning a qantized 13B model achievable in ~0.3 seconds per training example on a CPU.

replies(6): >>36217261 #>>36217324 #>>36217354 #>>36217827 #>>36218026 #>>36218841 #
56. rafark ◴[] No.36217145{4}[source]
I think it’s ironic that M$ made ClosedAI.
replies(1): >>36218112 #
57. jmoss20 ◴[] No.36217149{3}[source]
Quantization is hardly a "finding of the open source community". (IIRC the first TPU was int8! Though the tradition is much older than that.)
58. maxilevi ◴[] No.36217177{4}[source]
I agree with the spirit but saying that open source is on track to outpace OpenAI in innovation is just not true. Open source models are being compared to GPT3.5, none yet even get close to GPT4 quality and they finished that last year.
replies(1): >>36218569 #
59. jdonaldson ◴[] No.36217184{3}[source]
> I think you are setting yourself up for disappointment in the future.

Why would you say that?

replies(1): >>36237909 #
60. okhuman ◴[] No.36217213{3}[source]
Agreed with you 100% - its not easy. Sometimes I just wish someone as talented as Georgi would innovate not just on the core tech side but bring that same tenancy to the licensing side, in a way that aligns incentives better and tries out something new. And that the community would have his back if some new approach failed, no matter what.
61. freedomben ◴[] No.36217227{5}[source]
I would agree that there needs to be a balance because wasting time babysitting adults is dumb, but what if one person is a good and loved contributor, and the other is a social justice warrior new to the project that is picking fights with the contributor? Your philosophy makes not only bad stewardship but an injustice. I'm not suggesting this is the only scenario, just merely a hypothetical that I think illustrates my position.
62. moffkalast ◴[] No.36217261{5}[source]
Wow if that's true then it's genuinely a complete gamechanger for LLMs as a whole. You probably mean more like 0.3s per token, not per example, but that's still more like 1 or two minutes per training case, not like a day for 4 cases like it is now.
63. aryamaan ◴[] No.36217295[source]
Could someone at high level talk more about how one starts contributing to this kind of problems.

For the people who build solutions for data handling— ranging from crud to building highly scalable solutions— these things are alien concepts. (Or maybe I am just talking about it myself)

64. valval ◴[] No.36217324{5}[source]
I think more importantly, what would the fine tuning routine look like? It's a non-trivial task to dump all of your personal data into any LLM architecture.
65. f_devd ◴[] No.36217354{5}[source]
MeZO assumes a smooth parameter space, so you probably won't be able to do it with INT4/8 quantization, probably needs fp8 or smoother.
66. Havoc ◴[] No.36217403{3}[source]
I was thinking more edge in the distributed serverless sense, but I guess for this type of use the compute part is slow not the latency so question doesn't make much sense in hindsight
replies(1): >>36218877 #
67. yyyk ◴[] No.36217454{4}[source]
I've been going on and on about this in HN: Open source can win this fight, but I think OSS is overconfident. We need to be clear there are serious challenges ahead - ClosedAI and other corporations also have a plan, a plan that has good chances unless properly countered:

A) Embed OpenAI (etc.) API everywhere. Make embedding easy and trivial. First to gain a small API/install moat (user/dev: 'why install OSS model when OpenAI is already available with an OS API?'). If it's easy to use OpenAI but not open source they have an advantage. Second to gain brand. But more importantly:

B) Gain a technical moat by having a permanent data advantage using the existing install base (see above). Retune constantly to keep it.

C) Combine with existing propriety data stores to increase local data advantage (e.g. easy access for all your Office 365/GSuite documents, while OSS gets the scary permission prompts).

D) Combine with existing propriety moats to mutually reinforce.

E) Use selective copyright enforcement to increase data advantage.

F) Lobby legislators for limits that make competition (open or closed source) way harder.

TL;DR: OSS is probably catching up on algorithms. When it comes to good data and good integrations OSS is far behind and not yet catching up. It's been argued that OpenAI's entire performance advantage is due to having better data alone, and they intend to keep that advantage.

replies(1): >>36218897 #
68. okhuman ◴[] No.36217483{4}[source]
You're right. I just wish this decision was taken to the community, we could have all came together to help and supported during these difficult/transitional times. :( Maybe this decision was rushed or is money related, who knows the actual circumstances.

Here's the Matt K article https://mattklein123.dev/2021/09/14/5-years-envoy-oss/

69. doxeddaily ◴[] No.36217526[source]
This scratches my itch for no dependencies.
70. boywitharupee ◴[] No.36217540[source]
is graph execution used for training only or inference also?
replies(1): >>36217851 #
71. SparkyMcUnicorn ◴[] No.36217604[source]
Maybe I'm wrong, but I don't think you want it fine-tuned on your data.

Pretty sure you might be looking for this: https://github.com/SamurAIGPT/privateGPT

Fine-tuning is good for treating it how to act, but not great for reciting/recalling data.

replies(4): >>36219307 #>>36220595 #>>36226771 #>>36241658 #
72. smiley1437 ◴[] No.36217701{3}[source]
>> run LLaMA 30B at almost 3 tokens/s

Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

replies(3): >>36217877 #>>36217992 #>>36219745 #
73. jgrahamc ◴[] No.36217727{3}[source]
This is a good plan.
74. isoprophlex ◴[] No.36217827{5}[source]
If you go through the drudgery of integrating with all the existing channels (mail, Teams, discord, slack, traditional social media, texts, ...), such rapid finetuning speeds could enable an always up to date personality construct, modeled on you.

Which is my personal holy grail towards making myself unnecessary; it'd be amazing to be doing some light gardening while the bot handles my coworkers ;)

replies(2): >>36217987 #>>36221420 #
75. make3 ◴[] No.36217840[source]
does tbb work with apple Silicon?
replies(1): >>36217968 #
76. ignoramous ◴[] No.36217847[source]
Can LLaMA be used for commerical purposes though (might limit external contributors)? I believe, FOSS alternatives like DataBricks Dolly / Together RedPajama / Eluether GPT NeoX (et al) is where the most progress is likely to be at.
replies(5): >>36217910 #>>36218688 #>>36219223 #>>36219290 #>>36219343 #
77. LoganDark ◴[] No.36217851{3}[source]
Inference. It's a big bottleneck for RWKV.cpp, second only to the matrix multiplies.
78. LoganDark ◴[] No.36217877{4}[source]
> Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

Have you quantized it?

replies(1): >>36218570 #
79. pawelduda ◴[] No.36217891[source]
I happen to have RPi 4B with HomeAssistant. Is this something I could set up on it and integrate with HA to control it with speech, or is it overkill?
replies(2): >>36218115 #>>36221504 #
80. samwillis ◴[] No.36217910{3}[source]
Although llama.cpp started with the LLaMA model, it now supports many others.
81. yukIttEft ◴[] No.36217968{3}[source]
I guess https://formulae.brew.sh/formula/tbb
82. ◴[] No.36217987{6}[source]
83. oceanplexian ◴[] No.36217992{4}[source]
With a single NVIDIA 3090 and the fastest inference branch of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per second on the 30B models. IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.
replies(2): >>36219157 #>>36219874 #
84. gliptic ◴[] No.36218026{5}[source]
I cannot find any such numbers in the paper. What the paper says is that MeZO converges much slower than SGD, and each step needs two forward passes.

"As a limitation, MeZO takes many steps in order to achieve strong performance."

85. wmf ◴[] No.36218052{5}[source]
And what do you do when every contributor to the project, including the founder, has been labeled a troublemaker?
replies(2): >>36218229 #>>36225614 #
86. replygirl ◴[] No.36218112{5}[source]
Pedantic but that's not irony
replies(1): >>36220087 #
87. boppo1 ◴[] No.36218115[source]
I doubt it. I'm running 4-bit 30B and 65B models with 64GB ram, a 4080 and a 7900x. The 7B models are less demanding, but even so, You'll need more than an rpi. Even then, it would be a project to get these to control something. This is more 'first baby steps' toward the edge.
replies(1): >>36218738 #
88. huevosabio ◴[] No.36218127[source]
Very exciting!

Now, we just need a post that benchmarks the different options (ggml, tvm, AItemplate, hippoml) and helps deciding which route to take.

89. LoganDark ◴[] No.36218144{5}[source]
> and the furries

Um, what?

replies(1): >>36219391 #
90. Dwedit ◴[] No.36218145[source]
There was a big stink one time when the file format changed, causing older model files to become unusable on newer versions of llama.cpp.
91. coolspot ◴[] No.36218186{5}[source]
If we are talking about indexing, writing M$ is easier to find in an index because it is a such unique token. MS can mean many things (e.g. Miss), M$ is less ambiguous.
92. boppo1 ◴[] No.36218213{4}[source]
>justice

For an individual running a small open source project, there's time enough for coding or detailed justice, but not both. When two parties start pointing fingers and raising hell and its not immediately clear who is in the right, ban both and let them fork it.

93. edfletcher_t137 ◴[] No.36218219[source]
This is a bang-up idea, you absolutely love to see capital investment on this type of open, commodity-hardware-focused foundational technology. Rock on GGMLers & thank you!
94. mhh__ ◴[] No.36218226[source]
It's not a very good library IMO.
replies(1): >>36224140 #
95. boppo1 ◴[] No.36218229{6}[source]
Pick the fork that has devs who are focused on contributing code and not pursuing drama.
96. graycat ◴[] No.36218289[source]
WOW! They are using BFGS! Haven't heard of that in decades! Had to think a little: Yup, the full name is Broyden–Fletcher–Goldfarb–Shanno for iterative unconstrained non-linear optimization!

Some of the earlier descriptions of the optimization being used in the AI learning was about steepest descent, that is, just find the gradient of the function are trying to minimize and move some distance in that direction. Just using the gradient was concerning since that method tends to zig zag where after, say, 100 iterations the distance moved in the 100 iterations might be several times farther than the distance from the starting point of the iterations to the final one. Can visualize this zig zag already in just two dimensions, say, following a river, say, a river that curves, down a valley the river cut over a million years or so, that is, a valley with steep sides. Then gradient descent may keep crossing the river and go maybe 10 feet for each foot downstream!

Right, if just trying to go downhill on a tilted flat plane, then the gradient will point in the steepest descent on the plane and gradient descent will go all way downhill in just one iteration.

In even moderately challenging problems, BFGS can a big improvement.

97. smoldesu ◴[] No.36218362{4}[source]
Yeah, I think it feigns meaningful criticism. The "Sleepy Joe"-tier insults are ad-hominem enough that I don't try to respond.
98. jstarfish ◴[] No.36218398{4}[source]
> More details needed to know if this is good stewardship (ejecting two toxic individuals) or laziness (ejecting a villain and a hero to get rid of the "problem" easily). TikTok was using this method for a while by ejecting both bullies and victims,

This is SOP for American schools. It's laziness there, since education is supposed to be compulsory. They can't be bothered to investigate (and with today's hostile climate, I don't blame them) so they consign both parties to independent-study programs.

For volunteer projects, throwing both overboard is unfortunate but necessary stewardship. The drama either attracts destabilizes the entire project, which only exists as long as it remains fun for the maintainer. It's tragic, but victims who can't recover gracefully are as toxic as their abusers.

replies(1): >>36221044 #
99. infamouscow ◴[] No.36218505{3}[source]
The code is MIT licensed. If you don't agree with the direction the project is taking you can fork it and add whatever you want.

I don't understand why this is so difficult for software developers with GitHub accounts to understand.

replies(1): >>36218771 #
100. xiphias2 ◴[] No.36218539[source]
They are competing (although they are very different, tinygrad is full stack Python, ggml is focusing on a few very important models), but in my opinion George Hotz lost focus a bit by not working more on getting the low level optimizations perfect.
replies(1): >>36219975 #
101. jart ◴[] No.36218569{5}[source]
We're basically surviving off the scraps companies like Facebook have been tossing off the table, like LLaMA. The fact that we're even allowed and able to use these things ourselves, at all, is a tremendous victory.
replies(1): >>36218687 #
102. smiley1437 ◴[] No.36218570{5}[source]
The model I have is q4_0 I think that's 4 bit quantized

I'm running in Windows using koboldcpp, maybe it's faster in Linux?

replies(2): >>36219174 #>>36219792 #
103. KronisLV ◴[] No.36218585[source]
Just today, I finished a blog post (also my latest submission, felt like could be useful to some) about how to get something like this working in a bundle of something to run models, as well as a web UI for more easy interaction - in my case that was koboldcpp, which can run GGML, both on the CPU (with OpenBLAS) and on the GPU (with CLBlast). Thanks to Hugging Face, getting Metharme, WizardLM or other models is also extremely easy, and the 4-bit quantized ones provide decent performance even on commodity hardware!

I tested it out both locally (6c/12t CPU) and on a Hetzner CPX41 instance (8 AMD cores, 16 GB of RAM, no GPU), the latter of which costs about 25 EUR per month and still can generate decent responses in less than half a minute, my local machine needing approx. double that time. While not quite as good as one might expect (decent response times mean maxing out CPU for the single request, if you don't have a compatible GPU with enough VRAM), the technology is definitely at a point where it's possible for it to make people's lives easier in select use cases with some supervision (e.g. customer support).

What an interesting time to be alive, I wonder where we'll be in a decade.

replies(4): >>36218767 #>>36218947 #>>36219214 #>>36220027 #
104. jart ◴[] No.36218615{3}[source]
Whenever a community project goes commercial, its interests are usually no longer aligned with the community. For example, llama.com makes frequent backwards-incompatible changes to its file format. I maintain a fork of ggml in the cosmopolitan monorepo which maintains support for old file formats. You can build and use it as follows:

    git clone https://github.com/jart/cosmopolitan
    cd cosmopolitan

    # cross-compile on x86-64-linux for x86-64 linux+windows+macos+freebsd+openbsd+netbsd
    make -j8 o//third_party/ggml/llama.com
    o//third_party/ggml/llama.com --help

    # cross-compile on x86-64-linux for aarch64-linux
    make -j8 m=aarch64 o/aarch64/third_party/ggml/llama.com
    # note: creates .elf file that runs on RasPi, etc.

    # compile loader shim to run on arm64 macos
    cc -o ape ape/ape-m1.c   # use xcode
    ./ape ./llama.com --help # use elf aarch64 binary above
It goes the same speed as upstream for CPU inference. This is useful if you can't/won't recreate your weights files, or want to download old GGML weights off HuggingFace, since llama.com has support for every generation of the ggjt file format.
replies(1): >>36218893 #
105. maxilevi ◴[] No.36218687{6}[source]
I agree
106. okhuman ◴[] No.36218688{3}[source]
This is a very good question that will be interesting how this develops. thanks for posting the alternatives list.
107. make3 ◴[] No.36218722[source]
it's the library used for tensor operations inside of llama.cpp, yes
108. pawelduda ◴[] No.36218738{3}[source]
The article shows example running on RPI that recognizes colour names. I could just come up with keywords that would invoke certain commands and feed them to HA, which would match them to an automation (i.e. turn off kitchen, or just kitchen ) . I think a PoC is doable, but I'm aware I could run into limitations quickly. Idk might give it a try when I'm bored.

Would love voice assistant running locally but probably there are solutions out there - didn't get to do the research yet

replies(1): >>36222143 #
109. sva_ ◴[] No.36218756[source]
Really impressive work and I've asked this before, but is it really a good thing to have basically the whole library in a single 16k line file?
replies(3): >>36219310 #>>36220176 #>>36227321 #
110. SparkyMcUnicorn ◴[] No.36218767[source]
Seems like serverless is the way to go for fast output while remaining inexpensive.

e.g.

https://replicate.com/stability-ai/stablelm-tuned-alpha-7b

https://github.com/runpod/serverless-workers/tree/main/worke...

https://modal.com/docs/guide/ex/falcon_gptq

replies(2): >>36219155 #>>36227291 #
111. PrimeMcFly ◴[] No.36218771{4}[source]
You've missed the point here more than I've seen anyone miss the point in a long time.
replies(1): >>36219277 #
112. sp332 ◴[] No.36218841{5}[source]
It's the same memory footprint as inference. It's not that fast, and the paper mentions some optimizations that could still be done.
replies(1): >>36220688 #
113. binarymax ◴[] No.36218877{4}[source]
Compute is the latency for LLMs :)

And in general, your inference code will be compiled to a CPU/Architecture target - so you can know ahead of time what instructions you'll have access to when writing your code for that target.

For example in the case of AWS Lambda, you can choose graviton2 (ARM with NEON), or x86_64 (AVX). The trick is that for some processors such as Xeon3+ there is AVX 512, and others you will top out at AVX 256. You might be able to figure out what exact instruction set your serverless target supports.

114. ljlolel ◴[] No.36218897{5}[source]
Don’t forget chip shortages. That’s all centralized up through Nvidia, TSMC, and ASML
115. b33j0r ◴[] No.36218947[source]
I wish everyone in tech had your perspective. That is what I see, as well.

There is a lull right now between gpt4 and gpt5 (literally and metaphorically). Consumer models are plateauing around 40B for a barely-reasonable RTX 3090 (ggml made this possible).

Now is the time to launch your ideas, all!

116. Miraste ◴[] No.36218979{4}[source]
M$ is a silly way to call Microsoft greedy. ClosedAI is somewhat better because OpenAI's very name is a bald-faced lie, and they should be called on it. Are there more elegant ways to do that? Sure, but every time I see Altman in the news crying crocodile tears about the "dangers" of open anything I think we need all the forms of opposition we can find.
replies(1): >>36220202 #
117. boringuser2 ◴[] No.36218992[source]
Looking at the source of this kind of underlines the difference between machine learning scientist types and actual computer scientists.
replies(1): >>36227334 #
118. ignoramous ◴[] No.36219154{3}[source]
> The problem is, this financial backing and support is via VCs, who will steer the project to close it all up again.

A matter of when, not if. I mean, the website itself makes that much clear:

  The ggml way
  
    ...
  
    Open Core

    The library and related projects are freely available under the MIT license... In the future we may choose to develop extensions that are licensed for commercial use
  
    Explore and have fun!

    ... Contributors are encouraged to try crazy ideas, build wild demos, and push the edge of what's possible

So, like many other "open core" devtools out there, they'd like to have their cake and eat it too. And they might just as well, like others before them.

Won't blame anyone here though; because clearly, if you're as good as Georgi Gerganov, why do it for free?

replies(1): >>36223453 #
119. tikkun ◴[] No.36219155{3}[source]
I think that's true if you're doing minimal usage / low utilization, otherwise a dedicated instance will be cheaper.
replies(1): >>36220820 #
120. LoganDark ◴[] No.36219157{5}[source]
> IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.

I think it was originally designed to be easily embeddable—and most importantly, native code (i.e. not Python)—rather than competitive with GPUs.

I think it's just starting to get into GPU support now, but carefully.

121. LoganDark ◴[] No.36219174{6}[source]
> The model I have is q4_0 I think that's 4 bit quantized

That's correct, yeah. Q4_0 should be the smallest and fastest quantized model.

> I'm running in Windows using koboldcpp, maybe it's faster in Linux?

Possibly. You could try using WSL to test—I think both WSL1 and WSL2 are faster than Windows (but WSL1 should be faster than WSL2).

replies(1): >>36220358 #
122. iamflimflam1 ◴[] No.36219204[source]
I've always thought on the edge to be IoT type stuff. So running on embedded devices. But maybe that not the case?
replies(4): >>36219239 #>>36219263 #>>36224934 #>>36227647 #
123. digitallyfree ◴[] No.36219214[source]
The fact that this is commodity hardware makes ggml extremely impressive and puts the tech in the hands of everyone. I recently reported my experience running 7B llama.cpp on a 15 year old Core 2 Quad [1] - when that machine came out it was a completely different world and I certainly never imagined how AI would look like today. This was around when the first iPhone was released and everyone began talking about how smartphones would become the next big thing. We saw what happened 15 years later...

Today with the new k-quants users are reporting that 30B models are working with 2-bit quantization on 16GB CPUs and GPUs [2]. That's enabling access to millions of consumers and the optimizations will only improve from there.

[1] https://old.reddit.com/r/LocalLLaMA/comments/13q6hu8/7b_perf...

[2] https://github.com/ggerganov/llama.cpp/pull/1684, https://old.reddit.com/r/LocalLLaMA/comments/141bdll/moneros...

124. detrites ◴[] No.36219223{3}[source]
May also be worth mentioning - UAE's Falcon, which apparently performs well (leads?). Falcon recently had its royalty-based commercial license modified to be fully open for free private and commercial use, via Apache 2.0: https://falconllm.tii.ae/
replies(1): >>36226198 #
125. Y_Y ◴[] No.36219239[source]
Like any new term the (mis)usage broadens the meaning over time until it either it's widely known, it's unfashionable, or most likely; it becomes so broad as to be meaningless and hence it achieves buzzword apotheosis.

My old job title had "edge" in it, and I still don't know what it's supposed to mean, although "not cloud" is a good approximation.

replies(1): >>36219335 #
126. timerol ◴[] No.36219263[source]
"Edge computing" is a pretty vague term, and can encompass anything from a 8MHz ARM core that can barely talk compliant BLE, all the way to a multi-thousand dollar setup on something like a self-checkout machine, which may have more compute available than your average laptop. In that range are home assistants, which normally have some basic ML for wake word detection, and then send the next bit of audio to the cloud with a more advanced model for full speech-to-text (and response)
127. chaxor ◴[] No.36219290{3}[source]
Why is commercial necessary to run local models?
replies(1): >>36219403 #
128. dr_dshiv ◴[] No.36219307{3}[source]
How does this work?
replies(2): >>36219423 #>>36220553 #
129. regularfry ◴[] No.36219310[source]
It makes syncing between llama.cpp, whisper.cpp, and ggml itself quite straightforward.

I think the lesson here is that this setup has enabled some very high-speed project evolution or, at least, not got in its way. If that is surprising and you were expecting downsides, a) why; and b) where did they go?

replies(1): >>36221083 #
130. b33j0r ◴[] No.36219335{3}[source]
Sounds like your job had a lot of velocity with lateral tragmorphicity in Q1, just in time for staff engineer optimization!

Nicely done. Here is ~$50 worth of stock.

131. digitallyfree ◴[] No.36219343{3}[source]
OpenLLAMA will be released soon and it's 100% compatible with the original LLAMA.

https://github.com/openlm-research/open_llama

132. camdenlock ◴[] No.36219391{6}[source]
If you know, you know
133. az226 ◴[] No.36219393{3}[source]
Have you thought about what your path looks like to get to the next phase? Are you taking on any more investors pre-seee?
134. ignoramous ◴[] No.36219403{4}[source]
It isn't, but such models may eventually lag behind the FOSS ones.
135. deet ◴[] No.36219423{4}[source]
The parent is saying that "fine tuning", which has a specific meaning related to actually retraining the model itself (or layers at its surface) on a specialized set of data, is not what the GP is actually looking for.

An alternative method is to index content in a database and then insert contextual hints into the LLM's prompt that give it extra information and detail with which to respond with an answer on-the-fly.

That database can use semantic similarity (ie via a vector database), keyword search, or other ranking methods to decide what context to inject into the prompt.

PrivateGPT is doing this method, reading files, extracting their content, splitting the documents into small-enough-to-fit-into-prompt bits, and then indexing into a database. Then, at query time, it inserts context into the LLM prompt

The repo uses LangChain as boilerplate but it's pretty easily to do manually or with other frameworks.

(PS if anyone wants this type of local LLM + document Q/A and agents, it's something I'm working on as supported product integrated into macOS, and using ggml; see profile)

136. ignoramous ◴[] No.36219452{4}[source]
(a novice here who knows a couple of fancy terms)

> ...lately we have been augmenting it with GPU support.

Would you say you'd then be building an equivalent to Google's JAX?

Someone even asked if anyone would build a C++ to JAX transpiler [0]... I am wondering if that's something you may implement? Thanks.

[0] https://news.ycombinator.com/item?id=35475675

137. s1k3s ◴[] No.36219584[source]
I'm out of the loop on this entire thing so call me an idiot if I get it wrong. Isn't this whole movement based on a model leak from Meta? Aren't licenses involved that prevent it from going commercial?
replies(3): >>36219616 #>>36220025 #>>36221012 #
138. dimfeld ◴[] No.36219616[source]
Only the weights themselves. There have been other models since then built on the same Llama architecture, but trained from scratch so they're safe for commercial user. The GGML code and related projects (llama.cpp and so on) also support some other model types now such as Mosaic's MPT series.
139. beardog ◴[] No.36219661{3}[source]
>ggml.ai is a company founded by Georgi Gerganov to support the development of ggml. Nat Friedman and Daniel Gross provided the pre-seed funding.

Did you give them a different answer? It is okay if you can't or don't want to share, but I doubt the company is only planning to have fun. Regardless, best of luck to you and thank you for your efforts so far.

140. jart ◴[] No.36219666{3}[source]
Gerganov was prioritizing collaboration with 4chan who raided his GitHub to demand a change written by a transgender woman be reverted. There was so much hate speech and immaturity thrown around (words like tranny troon cucking muh model) that it's a real embarrassment (to those of us deeply want to see local models succeed) that one of the smartest guys working on the problem was taken in by all that. You can't run a collaborative environment that's open when you pander to hate, because hate subverts communities; it's impossible to compromise with anonymous trolls who harass a public figure over physical traits about her body she can't change.

You don't have to take my word on it. Here are some archives of the 4chan threads where they coordinated the raid. It went on for like a month. https://archive.is/EX7Fq https://archive.is/enjpf https://archive.is/Kbjtt https://archive.is/HGwZm https://archive.is/pijMv https://archive.is/M7hLJ https://archive.is/4UxKP https://archive.is/IB9bv https://archive.is/p6Q2q https://archive.is/phCGN https://archive.is/M6AF1 https://archive.is/mXoBs https://archive.is/68Ayg https://archive.is/DamPp https://archive.is/DiQC2 https://archive.is/DeX8Z https://archive.is/gStQ1

If you read these threads and see how nasty these little monsters are, you can probably imagine how Gerganov must have felt. He was probably scared they'd harass him too, since 4chan acts like he's their boy. I wouldn't even be surprised if he's one of them. Plus it was weak leadership on his part to disappear for days, suddenly show up again to neutral knight the situation (https://justine.lol/neutral-knight.png) by telling his team members they're no longer welcome, and then going back and deleting his comment later. It goes to show that no matter how brilliant you are at hard technical skills, you can still be totally clueless about people.

replies(6): >>36220053 #>>36220243 #>>36220672 #>>36220738 #>>36226346 #>>36245314 #
141. brucethemoose2 ◴[] No.36219745{4}[source]
I'n on a Ryzen 4900HS laptop with a RTX 2060.

Like I said, very modest

replies(1): >>36220337 #
142. brucethemoose2 ◴[] No.36219792{6}[source]
I am running linux with cublast offload, and I am using the new 3 bit quant that was just pulled in a day or two ago.
replies(2): >>36220323 #>>36222560 #
143. brucethemoose2 ◴[] No.36219874{5}[source]
Have you tried the most recent cuda offload? A dev claims they are getting 26.2ms/token (38 tokens per second) on 13B with a 4080.
144. zkmlintern ◴[] No.36219917[source]
ZKML fixes this
145. georgehotz ◴[] No.36219975{3}[source]
Which low level optimizations specifically are you referring to?

I'm happy with most of the abstractions. We are pushing to assembly codegen. And if you meant things like matrix accelerators, that's my next priority.

We are taking more a of breadth first approach. I think ggml is more depth first and application focused. (and I think Mojo is even more breadth first)

replies(1): >>36222732 #
146. detrites ◴[] No.36220025[source]
GGML is essentially a library of lego pieces that can be put together to work with many LLM or other types of ML models.

Meta's leaked model is one for which GGML has been applied to for fast, local inference.

147. c_o_n_v_e_x ◴[] No.36220027[source]
What do you mean by commodity hardware? Single server single CPU socket x86/ARM boxes? Anything that does not have a GPU?
replies(2): >>36220104 #>>36222947 #
148. killthebuddha ◴[] No.36220053{4}[source]
I didn't want to not reply but I also didn't want to be swept into a potentially fraught internet argument. So, I tried to edit my comment as a middle ground, but it looks like I can't, I guess there must be a timeout. If I could edit it, I'd add the following:

"I should point out that I wasn't personally involved, haven't looked into it in detail, and that there are many different perspectives that should be considered."

149. rafark ◴[] No.36220087{6}[source]
Why do you think so? According to the dictionary, ironic could be something paradoxical or weird.
replies(1): >>36220713 #
150. ◴[] No.36220104{3}[source]
151. FailMore ◴[] No.36220150[source]
Remember
152. FailMore ◴[] No.36220152[source]
Commenting to remember. Looks good
153. CamperBob2 ◴[] No.36220176[source]
Yes. Next question
154. tanseydavid ◴[] No.36220202{5}[source]
It is a colloquial spelling and they earned it, a long time ago.
155. zo1 ◴[] No.36220243{4}[source]
Really curious why you tried to rename the file format magic string to have your initials? Going from GGML (see Title of this post) to GGJT with JT being Justine Tunney? Seems quite unnecessary and bound to have rubbed a lot of people the wrong way.

Here is the official commit undoing the change:

https://github.com/ggerganov/llama.cpp/pull/711/files#diff-7...

replies(2): >>36220508 #>>36220715 #
156. smiley1437 ◴[] No.36220323{7}[source]
Thanks! I'll have to try the 3bit to see if that helps
157. smiley1437 ◴[] No.36220337{5}[source]
Are you offloading layers to the RTX2060?
replies(1): >>36221349 #
158. smiley1437 ◴[] No.36220358{7}[source]
I didn't know what WSL was, but now I do, thanks for the tip!
159. ex3ndr ◴[] No.36220486[source]
so sad we still don't have a perfect neural network for activation word to make home assistants complete
160. jart ◴[] No.36220508{5}[source]
Because the previous changes to the file format were done by changing the last two initials of the magic. Someone commented on the pull request suggesting using a version field instead, which hadn't been documented, but Gerganov was so happy with the mmap() contribution that he asked me to keep the initials. So you should ask him why he wanted my initials to be there. I didn't see anyone else raise concerns until later on when the 4chan raid happened. I guess I failed to consider that folks who hate trans women would feel uncomfortable needing to mark their files with the initials of one. Here's the pull request: https://github.com/ggerganov/llama.cpp/pull/613
161. SparkyMcUnicorn ◴[] No.36220553{4}[source]
deet already gave a comprehensive answer, but I'll add that the guts of privateGPT are pretty readable and only ~200 lines of code.

Core pieces: GPT4All (LLM interface/bindings), Chroma (vector store), HuggingFaceEmbeddings (for embeddings), and Langchain to tie everything together.

https://github.com/SamurAIGPT/privateGPT/blob/main/server/pr...

162. ◴[] No.36220595{3}[source]
163. version_five ◴[] No.36220633[source]
So I'll say it: I understand why someone would do it, I'm sure they backed up a money truck, but it sucks to see this sell out. VC is going to suck all the value out and leave something that exists to funnel money to them. This has been an awesome project. I hope somebody forks it and maintains a version that isn't profit motivated.
replies(2): >>36220908 #>>36222446 #
164. nl ◴[] No.36220688{6}[source]
Yes you are right.

I completely misread that!

165. nl ◴[] No.36220713{7}[source]
Well it's not paradoxical?

If one is the kind of person who writes M$ then it's pretty much expected behaviour.

166. _20p0 ◴[] No.36220715{5}[source]
Strange comment. This doesn't sound like a legitimate criticism because of "Here is the official commit undoing the change:" being not a link to a commit to start with, and secondly being a declined pull request, and thirdly for the reality that `master` writes `ggjt` header.

Really looks like some axe-grinding here, if I'm being honest. Especially because it takes very little effort to find out what the present header is by someone who can write software.

replies(1): >>36223367 #
167. ggmlanocoward ◴[] No.36220738{4}[source]
I get that a hateful mob jumped all over this widely-publicized PR and that's really, really not ok, but it doesn't make you automatically in the right. Sometimes our egos get the better of us, mistakes are made. Subsequently the only choice you have is between being someone who escalates drama and someone who defuses it. I promise you that being the latter is the better choice, even if it doesn't come with the ego-boosting joy of being "right". The person who can rise above it all is the one who ends up winning respect in the long-run, but it requires acknowledging one's own fallibility in the short-term.
replies(1): >>36237712 #
168. jart ◴[] No.36220764{5}[source]
Are you talking about Slaren? I wrote a blog post promoting him a while back. You can read it here: https://justine.lol/mmap/ It also spotlights other unsung heroes who played important roles in helping me bring mmap() to the machine learning community. As for me being trans, the problem is that 4chan does care. The /g/ local model discussion board was consumed with discussing my gender status for more than a month. Stirring up hate is how they rally anons to venture out on raids which have damaged my professional relationships.
replies(1): >>36224289 #
169. menzoic ◴[] No.36220820{4}[source]
You are correct. The pricing model guarantees this. Pay per compute vs pay for uptime (during which you could have more compute for cheaper)
170. okhuman ◴[] No.36220908[source]
I share the same sentiment and feel the same, its not you alone saying these things. It's starting to seem the llama.cpp project wasn't so community oriented to begin with - which in itself takes a lot of patience.
replies(1): >>36222027 #
171. ac29 ◴[] No.36221012[source]
It wasn't a leak, LLaMa was released publicly under an open-ish license (the code is GPL, the model weights require registration and prohibit commercial use).
172. mrtranscendence ◴[] No.36221044{5}[source]
Right, I for one would also prefer it if people who face harassment and hate would just accept it gracefully and move on. I mean, get over yourself, amiright?
173. mhh__ ◴[] No.36221083{3}[source]
https://git-scm.com/book/en/v2/Git-Tools-Submodules
replies(1): >>36224577 #
174. brucethemoose2 ◴[] No.36221349{6}[source]
Some of them, yeah. 17 layers iirc.
175. vgb2k18 ◴[] No.36221420{6}[source]
> while the bot handles my coworkers

Or it handles their bots ;)

176. addandsubtract ◴[] No.36221504[source]
Home Assistant has been working on its own speech recognition solution. They're calling 2023 "The year of the voice": https://www.home-assistant.io/blog/2023/04/27/year-of-the-vo...
177. jart ◴[] No.36221946{6}[source]
Just because someone can't wring your hand doesn't mean they don't deserve to influence a project. Releasing work under a license like MIT is an act of benevolence since it's giving up leverage and relying on de facto leadership. It's the hardest kind of leader to be, since de facto leaders are only followed on merit. It's not unusual at all for folks who started projects to end up caving to political pressures and getting replaced by the sorts of people you're talking about. But every now and again a Medici comes along.
178. shostack ◴[] No.36221973[source]
I've been trying to figure out what I might need to do in order to turn my Obsidian vault into a dataset to fine tune against. I'd invest a lot more into it now if I thought it would be a key to an AI learning about my the way it does in the movie Her.
replies(2): >>36222384 #>>36384485 #
179. swyx ◴[] No.36222027{3}[source]
> It's starting to seem the llama.cpp project wasn't so community oriented to begin with

what do you mean? llama.cpp has 181 contributors.

180. kkielhofner ◴[] No.36222143{4}[source]
Shameless plug, I'm the founder of Willow[0].

In short you can:

1) Run a local Willow Inference Server[1]. Supports CPU or CUDA, just about the fastest implementation of Whisper out there for "real time" speech.

2) Run local command detection on device. We pull your Home Assistant entities on setup and define basic grammar for them but any English commands (up to 400) are supported. They are recognized directly on the $50 ESP BOX device and sent to Home Assistant (or openHAB, or a REST endpoint, etc) for processing.

Whether WIS or local our performance target is 500ms from end of speech to command executed.

[0] - https://github.com/toverainc/willow

[1] - https://github.com/toverainc/willow-inference-server

replies(1): >>36239036 #
181. Tan-Aki ◴[] No.36222313[source]
Can anyone explain to me, in simple terms, and at a high level, what the heck am I looking at? What is this library for? What does it mean "it is used by lama.cpp and whisper.cpp"? How is it revolutionary? Thank you very much in advance!
replies(3): >>36222346 #>>36222597 #>>36225600 #
182. Tan-Aki ◴[] No.36222346[source]
With maybe a tiny bit of history as well? Pretty please? : p
183. 58x14 ◴[] No.36222384{3}[source]
I've been working on this for awhile now and I'd love to chat. I'll send you an email.
replies(1): >>36222595 #
184. fallous ◴[] No.36222446[source]
I don't mind a project that is profit motivated since it incentivizes the people working on the project as well as drives them to create solutions that solve customer problems. I do, however, have a problem with VC backing due to their frequent insistence on pushing for short-term maximal share pricing rather than long-term value. Modern VC also seem to suffer from a fatal mixture of hubris and narcissism, certain of their genius and business acumen despite all evidence to the contrary.

When advising start-ups who are taking on VC funding and the inevitable "business guidance" they push on founders I like to remind them that even the best investors are only good at investing. If they were good at business, they wouldn't bother with investing in other companies but instead start companies themselves since they would own 100% and maximize their returns.

185. LoganDark ◴[] No.36222560{7}[source]
cuBLAS or CLBlast? There is no such thing as cublast
186. legendofbrando ◴[] No.36222595{4}[source]
I'm interested in this as well and have been exploring similarly. Would be super interesting to chat if you're up for it as well. Sending you an email to say hello.
187. orost ◴[] No.36222597[source]
ggml is a library that provides operations for running machine learning models

llama.cpp is a project that uses ggml to run LLaMA, a large language model (like GPT) by Meta

whisper.cpp is a project that uses ggml to run Whisper, a speech recognition model by OpenAI

ggml's distinguishing feature is efficient operation on CPU. Traditionally, this sort of work is done on GPU, but GPUs with large amounts of memory are specialized and extremely expensive hardware. ggml achieves acceptable speed on commodity hardware.

replies(1): >>36225610 #
188. xiphias2 ◴[] No.36222732{4}[source]
Maybe I'd love to see Tinygrad beat GGML in its own game (4 bit LLM support on M1 Mac GPU or Tensor cores) before adding more backends / models.

It's easy to debug because the generated kernels can be compared to GGML, and still gives something practical that we all can play with.

At this point breadth first is a bit boring, because this way we don't know how far tinygrad is from optimal generated output.

replies(1): >>36235891 #
189. noman-land ◴[] No.36222742[source]
Georgi if you're reading this, I've had a lot of fun with whisper.cpp llama.cpp because of you so thank you very much.
replies(1): >>36224059 #
190. legendofbrando ◴[] No.36222814[source]
Running whisper locally on my iPhone back in December and watching perfect transcriptions pop out without sending anything to a server was a real lightbulb moment for me that set in motion a bunch of the work I’m doing now. Excited to see the new heights this unlocks!
replies(1): >>36223623 #
191. kyt ◴[] No.36222819[source]
I used the GGML version of Whisper and I had to revert back to the PyTorch version released by OpenAI. The GGML version simply didn't work well even for the same model. I am assuming it has to do with the quantization.
192. ankitg12 ◴[] No.36222913[source]
Quite impressive, able to run a LLM on my local mac

    % ./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "Let's talk about Machine Learning now"
    main: seed = 1686112244
    gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model.bin'
    gpt2_model_load: n_vocab = 50257
    gpt2_model_load: n_ctx   = 1024
    gpt2_model_load: n_embd  = 768
    gpt2_model_load: n_head  = 12
    gpt2_model_load: n_layer = 12
    gpt2_model_load: ftype   = 1
    gpt2_model_load: qntvr   = 0
    gpt2_model_load: ggml tensor size = 224 bytes
    gpt2_model_load: ggml ctx size = 384.77 MB
    gpt2_model_load: memory size =    72.00 MB, n_mem = 12288
    gpt2_model_load: model size  =   239.08 MB
    extract_tests_from_file : No test file found.
    test_gpt_tokenizer : 0 tests failed out of 0 tests.
    main: prompt: 'Let's talk about Machine Learning now'
    main: number of tokens in prompt = 7, first 8 tokens: 5756 338 1561 546 10850 18252 783

    Let's talk about Machine Learning now.

    The first step is to get a good understanding of what machine learning is. This is where things get messy. What do you think is the most difficult aspect of machine learning?

    Machine learning is the process of transforming data into an understanding of its contents and its operations. For example, in the following diagram, you can see that we use a machine learning approach to model an object.

    The object is a piece of a puzzle with many different components and some of the problems it solves will be difficult to solve for humans.

    What do you think of machine learning as?

    Machine learning is one of the most important, because it can help us understand how our data are structured. You can understand the structure of the data as the object is represented in its representation.

    What about data structures? How do you find out where a data structure or a structure is located in your data?

    In a lot of fields, you can think of structures as

    main: mem per token =  2008284 bytes
    main:     load time =   366.33 ms
    main:   sample time =    39.59 ms
    main:  predict time =  3448.31 ms / 16.74 ms per token
    main:    total time =  3894.15 ms
193. KronisLV ◴[] No.36222947{3}[source]
> What do you mean by commodity hardware?

In my case, my local workstation has a Ryzen 5 1600 desktop CPU from 2017 (first generation Zen, 14nm) and it still worked decently.

Of course, response times would grow with longer inputs and outputs or larger models, but getting a response in less than a minute when running off of purely CPU is encouraging in of itself.

194. dangrover ◴[] No.36223128[source]
There's so much potential for this as a tech powering all sorts of products, I hope it doesn't just become some Tarsnap type thing. (judging from initial site)
replies(1): >>36223613 #
195. zo1 ◴[] No.36223367{6}[source]
I'm a dev and know how to use git. Honestly that's just the PR I could find whilst catching up on this story and was curious what this "magic string" was. The change is there and got reverted 100%, see the other commenter who was the one that made it. If there is a better link you're welcome to post it.
196. ukuina ◴[] No.36223453{4}[source]
Sounds like the SQLite model, which has been a net positive for the computing world.
197. cperciva ◴[] No.36223613[source]
What's wrong with "Tarsnap type things"?
198. bugglebeetle ◴[] No.36223623[source]
What are you using to run Whisper locally?
replies(1): >>36236590 #
199. hanselot ◴[] No.36224059[source]
I envy his drive and ambition. I can't force myself to finish writing a simple alarm clock app for android, never mind pathing the literal road to the future of Open Source AI.

Would someone else have taken his place had he not been around? Maybe, but I'm insanely happy that he is around.

The amount of hours I've sunk into LLM's is crazy, and it's mostly thanks to his work that I can both run and download models in meaningful timeframes.

And yes, I have tested llama.cpp on my android and it works 100% on termux. (Your biggest enemy here will be Android process reaper when you hit the memory cap)

replies(1): >>36377964 #
200. dindresto ◴[] No.36224140{3}[source]
ggml or Intel TBB?
201. read_if_gay_ ◴[] No.36224289{6}[source]
>I wrote a blog post promoting him a while back.

That's about 2 weeks after the drama around PR 613, which you factually touted as "your work" in several different places.

202. PostOnce ◴[] No.36224448{3}[source]
"Good" is subjective, I guess.

Daniel Gross set up a company that seemed akin to indebted servitude, modern day slavery. They called it "Pioneer" and later changed the terms, I guess because of backlash.

They gave you a little bit of money to "do whatever you want" but owned a huge stake in anything you did in the future for a long period of time. They didn't advertise that part very heavily, they mostly portrayed it as "we're doing this because we're philanthropists" imo, rather than because they wanted to reinvent indentured servitude within the modern legal framework.

Why do I write these posts? Because I desperately want to believe we can get rich without doing dishonest, evil things. Maybe I'm wrong. Maybe that's why all these guys behave this way. Maybe it really is never enough.

203. regularfry ◴[] No.36224577{4}[source]
Or... `cp`. It's fine.
204. anentropic ◴[] No.36224934[source]
Here it means more 'on your own device' rather than 'in the cloud'

You could consider that the real edge, whereas edge computing often means 'at the edge of the cloud' i.e. local CDN node

205. statusfailed ◴[] No.36225580[source]
What kind of applications do you see for training on mobile devices? Is anyone using this in industry?
206. Tan-Aki ◴[] No.36225600[source]
Thank you so much for your kindness Orost. Sharing really IS caring. I understand.

May good things happen to you. Peace.

207. Tan-Aki ◴[] No.36225610{3}[source]
Thank you so much for your kindness Orost. Sharing really IS caring. I understand.

May good things happen to you. Peace.

208. csmpltn ◴[] No.36225614{6}[source]
I'm confused about the scenario you're describing here.

Look, my message is simple and clear: keep the politics and drama out of it. If you partake in politics and drama, you'll be ejected from the project. I don't have the time or the energy to police or play games with people. We're here to build things, not to partake in social activism or sling crap at each other over codes of conduct, pronouns, hair color or magic strings. If you're hurt - fork the project (as long as the license allows for it) and have fun playing somewhere else.

replies(1): >>36229180 #
209. mistercow ◴[] No.36226198{4}[source]
Hugging Face has a demo of the 40B Falcon instruct model: https://huggingface.co/blog/falcon#demo

It’s pretty good as models of that size go, although it doesn’t take a lot of playing around with it to find that there’s still a good distance between it and ChatGPT 3.5.

(I do recommend editing the instructions before playing with it though; telling a model this size that it “always tells the truth” just seems to make it overconfident and stubborn)

210. henry_viii ◴[] No.36226346{4}[source]
Just wanted to share I think you're a superstar engineer.

Also from the links you shared it looked like some users on 4chan decided to go out and harass you. If they didn't know you are a trans woman, I'm sure they would've defaulted to calling you a n***** f***** instead. But they were going to harass you nonetheless.

It was very sad to see how things developed over a small issue. I'm sure this could've gotten resolved civilly since I believe you and everyone else involved in the project had good intentions and were doing everything out of love.

211. gtirloni ◴[] No.36226771{3}[source]
> Fine-tuning is good for treating it how to act, but not great for reciting/recalling data.

What underlying process makes it this way? Is it because the prompt has heavier weight?

replies(2): >>36229475 #>>36242863 #
212. baobabKoodaa ◴[] No.36227291{3}[source]
I think cold start times will be excessive for serverless in this use case.
replies(1): >>36228971 #
213. baobabKoodaa ◴[] No.36227321[source]
I guess the "clean code" crowd would like to refactor this into hundreds of files that all call each other in an incomprehensible maze, plus pulling in 20GB of dependencies from the internet during install. Because that is the way™.
214. baobabKoodaa ◴[] No.36227334[source]
Can you elaborate?
replies(1): >>36234114 #
215. java_beyb ◴[] No.36227647[source]
edge brings compute close to where data is generated, cloud brings data to compute.

even processing something in a web browser is called edge. i guess due to this impression the industry is moving towards "on-device"

216. SparkyMcUnicorn ◴[] No.36228971{4}[source]
3 second cold start is good enough for me.
217. wmf ◴[] No.36229180{7}[source]
The scenario is a mob of trolls attacking a contributor in bad faith. If you kick out the contributor, what's to stop the mob from picking off someone else?
replies(1): >>36238847 #
218. SparkyMcUnicorn ◴[] No.36229475{4}[source]
I think your question is asking about the fundamentals of how an LLM works, which I'm not really qualified to answer. But I do have a general understanding of it all.

Fine-tuning is like having the model take a class on a certain subject. By the end of the class, it's going to have a general understanding on how to do that thing, but it's probably going to struggle when trying to quote the textbooks verbatim.

A good use-case for fine-tuning is teaching it a response style or format. If you fine-tune a model to only respond in JSON, then you no longer need to include formatting instructions in your prompt to get a JSON output.

219. boringuser2 ◴[] No.36234114{3}[source]
https://github.com/ggerganov/ggml/blob/master/src/ggml-openc...
replies(1): >>36261707 #
220. Art9681 ◴[] No.36235891{5}[source]
I just deployed tinygrad thanks to this conversation and I've played with just about every local LLM client and toolchain there is. I just ran the examples as listed in the repo with absolutely zero problems and they just worked. I think their goals of prioritizing ease of use far outweighs any performance optimizations at this stage of the game. Nothing is stopping the team from integrating other projects if their performance delta is worth the pivot.

From what I see, the foundation is there for a great multimodal platform. Very excited to see where this goes.

221. ariym ◴[] No.36236590{3}[source]
whisper.cpp is optimized for Apple Silicon and is available as a Swift package

https://github.com/ggerganov/whisper.spm

222. jeadie ◴[] No.36237271[source]
I'm very glad that this has some added funding. I am building a serverless API on the cloudflare edge network using GGML as the backbone --> tryinfima.com
223. jart ◴[] No.36237712{5}[source]
Not until I'm made whole. I donated a lot of resources to the llama.cpp project. I volunteered and successfully contributed one of its most impactful features. I was rewarded with harassment and being publicly humiliated by its leader, for no reason at all. They also reneged on promises they made me. I'm owed a lot more than apology, but I haven't even received that.
replies(1): >>36245385 #
224. rvz ◴[] No.36237909{4}[source]
Never expect such promises to go your way, especially when VCs, angels, etc are able to control the project with their opaque terms sheet, which is why I am skeptical of this. Accepting VC, angel investment cash is no different to having another boss.

I am expecting such high expectations like that to end in disappointment for the 'community' since the interests will now be in the VCs to head for the exit. Their actions will speak more than what they are saying on the website.

225. anaganisk ◴[] No.36238452{4}[source]
Genuinely curious, but why didn't you? Didn't the project gain enough visibility before you noticed this post, or was there any other reason which would be helpful for others to know.
replies(1): >>36243878 #
226. csmpltn ◴[] No.36238847{8}[source]
If a bunch of random strangers (external to the project) are "attacking" your project somewhere on the internet (for example, on Twitter) - just ignore them and move on with your day. They don't have any power over your project and their opinions don't matter. Go on with your life and continue building.

If a bunch of random strangers (external to the project) are messing with your tools and workflows (stirring things up in the issue tracker, creating drama and playing games with silly Pull Requests and comments) - lock down your tools such that they can only be used by trusted members of your team. Close down and remove all bullshit conversations without spending any further time or energy on any of it. Platforms like GitHub blur the lines between "a suite of productivity tools for software development" and "a social network" - so make sure to lock down and limit the "social networks" aspects whilst optimizing for the "software development productivity" aspect. Go on with your life and continue building.

If the "attacks" happens internally within the project (between two or more members of the team) - eject all parties involved because they're clearly not here to build stuff. Go on with your life and continue building.

Your goal should be to spend your energy on building and creating, and collaborating with like-minded people on building and creating. Not on policing, moderating, or playing games with people.

227. pawelduda ◴[] No.36239036{5}[source]
Thank you very much, I have run into Willow before during my brief research and liked it in general, this also sounds convincing
228. torginus ◴[] No.36239156[source]
I wonder, if Nat Friedman, who was the CEO of GitHub until recently, would work to tie this to the Microsoft/OpenAI LLM empire?

Or am I just being paranoid?

229. SkyPuncher ◴[] No.36241658{3}[source]
I think people want both. They want fine tuning for their style of communication and interaction. They want better rank and retrieval for rote information.

In other words, it’s like having a spouse/partner. There are certain ways that we communicate that we simply know where the other person is at or what they actually mean.

replies(1): >>36243366 #
230. dbyte ◴[] No.36242599[source]
Congrats
231. bluepoint ◴[] No.36242863{4}[source]
I just read the paper about LORA. The main idea is that you write the weights of each neural network as

W = W0 + B A

Where W0 is the trained model’s weights, which are kept fixed, and A and B are matrices but with a much much lower rank than the originals (say r = 4).

It has been shown (as mentioned in the lora paper that training for specific tasks results in low rank corrections, so this is what it is all about. I think that doing LoRa can be done locally.

[1] https://github.com/microsoft/LoRA

232. SparkyMcUnicorn ◴[] No.36243366{4}[source]
Unless you want machine-readable responses, or some other very specific need, the benefits of a fine-tuned model aren't really going to be that much better than a prompt that asks for the style you want along with an example or two. It also raises the barrier to entry quite a bit, since the majority of computers that can run the model aren't capable of training on it.

Even if you're using OpenAI's models, gpt-3.5-turbo is going to be much better (cheaper, bigger context window, higher quality) than any of their models that can be fine-tuned.

But if you're able to fine-tune a local model, then a combination of fine-tuning and embedding is probably going to give you better results than embedding alone.

233. jgrahamc ◴[] No.36243878{5}[source]
Had no idea he wanted to make it a company.
replies(2): >>36245167 #>>36252892 #
234. throw74775 ◴[] No.36245167{6}[source]
That makes sense - it looked very much like a pure open source project.

I wonder if they came to him or if someone else facilitated it as opposed to it being his initiative.

235. IAmNotACellist ◴[] No.36245314{4}[source]
Liar. https://news.ycombinator.com/item?id=35455930#35458068

This user stole another user's code, closed his PR, and opened a new one where she started using words like "my work," "I'm the author," "author here," etc., and trying to cozy up to the project lead.

Gerganov figured out what was happening and actually banned her from all further contributions. The user whose code was stolen, Slaren, is still contributing.

replies(1): >>36251741 #
236. IAmNotACellist ◴[] No.36245385{6}[source]
You didn't write that feature. Slaren did. You closed his PR and made minor changes, then gradually shifted from "our feature" to "my feature."

----

That's not the original PR. jart was working on a malloc() approach that didn't work and slaren wrote all the code actually doing mmap, which jart then rebased in a random new PR, changed to support an unnecessary version change, magic numbers, a conversion tool, and WIN32 support when that was already working in the draft PR. https://archive.ph/Uva8c

This is the original PR: https://github.com/ggerganov/llama.cpp/pull/586.

Jart's archived comments:

"my changes"

"Here's how folks in the community have been reacting to my work."

"I just wrote a change that's going to let your LLaMA models load instantly..."

https://archive.ph/PyPFZ

"I'm the author"

https://archive.ph/qFrcY

"Author here..."

"Tragedy of the commons...We're talking to a group of people who live inside scientific papers and jupyer notebooks."

"My change helps inference go faster."

"The point of my change..."

"I stated my change offered a 2x improvement in memory usage."

https://archive.ph/k34V2

"I can only take credit for a 2x recrease in RAM usage."

https://archive.ph/MBPN0

"I just wrote a change that's going to let your LLaMA models load instantly, thanks to custom malloc() and the power of mmap()"

https://archive.ph/yrMwh

slaren replied to jart on HN asking her why she was doing and saying those things, and she didn't bother to reply to him, despite replying to others in that subthread within minutes. https://archive.ph/zCfiJ

----

You didn't make whole the people you damaged or the project you attempted to harm with plagiarism and pathological levels of manipulation and lying.

237. jart ◴[] No.36251741{5}[source]
Good artists copy and great artists steal.
replies(1): >>36252421 #
238. IAmNotACellist ◴[] No.36252421{6}[source]
>Good artists copy and great artists steal.

This user claims Gerganov publicly humiliated her, but she does it to herself.

239. anaganisk ◴[] No.36252892{6}[source]
Ah, right, Lol, I didn't think of that at all. But could this be a wider problem than this? Many open source projects that could've been "sponsored" by someone like you, but ended up being commercialized by vested interests?
240. baobabKoodaa ◴[] No.36261707{4}[source]
Wow, what a passive aggressive response. So, I take it that you can't elaborate. Got it.
241. ffvvtvtbyh ◴[] No.36377964{3}[source]
Keen to work together. I also struggle with follow through
242. mydjtl ◴[] No.36384485{3}[source]
The holy grail.

https://github.com/brianpetro/obsidian-smart-connections

https://wfhbrian.com/introducing-smart-chat-transform-your-o...