Most active commenters
  • linearalgebra45(7)
  • sillysaurusx(5)
  • (5)
  • v64(3)
  • MacsHeadroom(3)
  • nl(3)

←back to thread

343 points sillysaurusx | 43 comments | | HN request time: 4.358s | source | bottom
1. linearalgebra45 ◴[] No.35028638[source]
It's been enough time since this leaked, so my question is why aren't there blog posts already of people blowing their $300 of starter credit with ${cloud_provider} on a few hours' experimentation running inference on this 65B model?

Edit: I read the linked README.

> I was impatient and curious to try to run 65B on an 8xA100 cluster

Well?

replies(2): >>35028936 #>>35030027 #
2. v64 ◴[] No.35028936[source]
The compute necessary to run 65B naively was only available on AWS (and perhaps Azure, I don't work with them) and the required instance types have been unavailable to the public recently (it seems everyone had the same idea to hop on this and try to run it). In my other post here [1], the memory requirements have been lowered through other work, and it should now be possible to run the 65B on a provider like CoreWeave.

[1] https://news.ycombinator.com/item?id=35028738

replies(2): >>35029106 #>>35029766 #
3. linearalgebra45 ◴[] No.35029106[source]
Are you sure about that? I can't remember where I saw the table of memory requirements, but I'm sure some of the larger instances here [1] will surely be able to cope (assuming they're available!)

Oracle gives you a $300 free trial, which equates to running BM.GPU4.8 for over 10 hours - enough for a focused day of prompting

[1] https://www.oracle.com/cloud/compute/gpu/

replies(3): >>35029110 #>>35030261 #>>35034167 #
4. v64 ◴[] No.35029110{3}[source]
> Are you sure about that?

I'm not. The only way to know it is to try :) thank you for the link!

replies(1): >>35029159 #
5. linearalgebra45 ◴[] No.35029159{4}[source]
You only get a single month-long window to spend the credit! And I'm sure not going to spend any of my own money on prompting experiments.

I might be suffering from FOMO to some degree, I've just got to tell myself that this won't have been the only time model weights get leaked!

replies(1): >>35030979 #
6. MacsHeadroom ◴[] No.35029766[source]
I'm running LLaMA-65B on a single A100 80GB with 8bit quantization. $1.5/hr on vast.ai
replies(7): >>35030000 #>>35030059 #>>35031427 #>>35136771 #>>35145917 #>>35189078 #>>35189095 #
7. linearalgebra45 ◴[] No.35030000{3}[source]
What instance are you using?
8. ulnarkressty ◴[] No.35030027[source]
https://medium.com/@enryu9000/mini-post-first-look-at-llama-...

*later edit - not the 65G model, but the smaller ones. Performance seems mixed at first glance, not really competitive with ChatGPT fwiw.

replies(2): >>35030082 #>>35031470 #
9. sillysaurusx ◴[] No.35030059{3}[source]
Careful though — we need to evaluate llama on its own merits. It’s easy to mess up the quantization in subtle ways, then conclude that the outputs aren’t great. So if you’re seeing poor results vs gpt-3, hold off judgement till people have had time to really make sure the quantized models are >97% the effectiveness of the original weights.

That said, this is awesome — please share some outputs! What’s it like?

replies(1): >>35030162 #
10. linearalgebra45 ◴[] No.35030082[source]
> not the 65G model, but the smaller ones

Haha, that's right! I saw that one too

11. MacsHeadroom ◴[] No.35030162{4}[source]
The output is at least as good as davinci.

I think some early results are using bad repetition penalty and/or temperature settings. I had to set both fairly high to get the best results. (Some people are also incorrectly comparing it to chatGPT/ChatGPT API which is not a good comparison. But that's a different problem.)

I've had it translate, write poems, tell jokes, banter, write executable code. It does it all-- and all on a single card.

replies(4): >>35030413 #>>35030561 #>>35033070 #>>35041807 #
12. smoldesu ◴[] No.35030261{3}[source]
Thanks for sharing it! I'm using their "Always Free" tier to host an Ampere-accelerated GPT-J chatbot right now. Works like a charm, and best of all, it's free!
replies(2): >>35030667 #>>35031376 #
13. akreal ◴[] No.35030413{5}[source]
Which prompt did you use for translation? I'd be curious to try it for my task too.
14. sillysaurusx ◴[] No.35030561{5}[source]
That's great to hear. Thank you very much, both for reporting this, and especially for the crucial note about temperature.

In fact, sampling settings are so important and so easily underestimated that I should just pester you to post your exact settings. If you get a moment, would you mind sharing your temperature, repetition penalty, top-k, and anything else? I'll be experimenting with those today, but having some known working defaults would be wonderful. (You're also the first person I've seen that got excellent outputs from llama; whatever you did, no one else seems to have noticed yet.)

If you're busy or don't feel like it, no worries though. I'm just grateful you gave us some hope that llama might be really good. There were so many tweet chains showing universally awful outputs that I wasn't sure.

EDIT: I added your comments to the top of the README and credited you. Thanks again.

replies(1): >>35030600 #
15. linearalgebra45 ◴[] No.35030600{6}[source]
Would you mind publishing your notes/learnings once you gain enough understanding of this model?
replies(1): >>35030780 #
16. jocaal ◴[] No.35030667{4}[source]
I don't understand, the Ampere they refer to in their free tier are cpu's not gpu's. How did you manage to do that
replies(1): >>35030865 #
17. sillysaurusx ◴[] No.35030780{7}[source]
Absolutely! I'll make sure to leave a comment here for you whenever something gets written up so you don't miss it.

Getting "as good as davinci" on a single A100 is groundbreaking work. Facebook and the community should both be credited here -- maybe llama-int8 would've been created even if the model hadn't leaked, but I don't think it would've happened so quickly. Everyone is doing phenomenal work, and it's so amazing to see it all come together.

But, we'll see. Going to try it myself soon.

Long ago, I cloned OpenAI's API: https://github.com/shawwn/openai-server -- my plan is, once I get it running, I'll try to host it somewhere so that anyone can play with it. I assume it'll be quickly swamped, but it's still an interesting challenge; some basic load balancing should make it scalable across several A100 instances, so there's no reason we can't just roll our own OpenAI API.

replies(1): >>35031132 #
18. smoldesu ◴[] No.35030865{5}[source]
Custom PyTorch with on-chip acceleration: https://cloudmarketplace.oracle.com/marketplace/en_US/listin...

Not as fast as a GPU, but less than 5 seconds for a 250 token response is good enough for a Discord bot.

replies(1): >>35034200 #
19. mynameisvlad ◴[] No.35030979{5}[source]
> And I'm sure not going to spend any of my own money on prompting experiments.

This certainly sounds a lot like whining that others aren’t doing the work you yourself don’t want to do.

replies(1): >>35031273 #
20. rnosov ◴[] No.35031132{8}[source]
Seconded. Do write it up.

I see vast.ai listing interruptible instance with a single A100 80GB at $1/hour which is pretty reasonable. ChatGPT plus is $20/month which would be roughly 20 hours of use and I wont't be lectured like I'm in a kindergarten or something.

A bonus point would be to make the writeup accessible for AI challenged developers. Asking for a friend.

replies(1): >>35033200 #
21. linearalgebra45 ◴[] No.35031273{6}[source]
"prompting experiments" is just my use-case. According to v64 a lot of people have had the same idea of spinning up a trial instance to run inference, which is unsurprising.

I'm not in a position to put in any meaningful work towards optimising this model for lower-end hardware, or working on the tooling/documentation/user experience.

22. damascus ◴[] No.35031376{4}[source]
Do you have any code from your discord bot you're willing to share? I'd be happy to share back any updates I made to it. I've been wanting to play with this idea for a bit.
replies(1): >>35032653 #
23. youssefabdelm ◴[] No.35031427{3}[source]
What's the speed like? How many tokens per second? / Is it as fast as say ChatGPT?
replies(1): >>35107419 #
24. minxomat ◴[] No.35031470[source]
> not really competitive with ChatGPT

That's impossible to judge. LLama is a foundational model. It has received neither instructional fine tuning (davinci-3) nor RLHF (ChatGPT). It cannot be compared to these finetuned models without, well, finetuning.

25. ◴[] No.35032653{5}[source]
26. v64 ◴[] No.35033070{5}[source]
Note that unlike ChatGPT, these models are pure text completers and have not been trained to be prompted. The llama FAQ [1] mentions this and gives tips for how to get out of the ChatGPT mindset and prompt llama better.

[1] https://github.com/facebookresearch/llama/blob/main/FAQ.md#2

27. davrosthedalek ◴[] No.35033200{9}[source]
I would like to support this request for AI challenged developers :)

For things like these, I always wonder: How much slower would it be to run such a model on a CPU? I mean, clearly a lot less interactive, but is it possible at all? Could it be chopped up and "streamed" to a GPU with less memory halfway efficiently? What is the bottleneck currently on GPUs, memory bw or compute?

replies(3): >>35034229 #>>35034416 #>>35036899 #
28. fswd ◴[] No.35034167{3}[source]
If you actually try and do this, the sales people will stop you due to some internal rule. No GPUs on free credit. Unless the situation has changed of course..
29. nl ◴[] No.35034200{6}[source]
This is the most interesting thing I've read in this thread. How have I never heard of this accelerator?!
30. nl ◴[] No.35034229{10}[source]
On a CPU I'd estimate it would get a maximum of around 5 tokens per second (a token being a sub-word token, so generally a couple of letters). I suspect it'd be more like 1 token per second on the large model without additional optimisation.

Yes models can be split up. See eg Hugging Face Accelerate.

replies(1): >>35035819 #
31. ◴[] No.35034416{10}[source]
32. davrosthedalek ◴[] No.35035819{11}[source]
That's actually a lot better than I would have thought. Almost usable, and a good exercise in patience.
replies(1): >>35036082 #
33. nl ◴[] No.35036082{12}[source]
I'd expect significant performance improvements over the next few months are more people work on this in the same way the stable diffusion is now fairly usable on a CPU. It's always going to be slow on a CPU, but the smaller models might be usable for experimentation at some point.
34. sillysaurusx ◴[] No.35036899{10}[source]
Update: initial results are promising. https://twitter.com/theshawwn/status/1632569215348531201

I'll try to do a writeup on everything. In the meantime, please see that tweet chain for future updates for now. (I have some work to do tomorrow so I'm just tweeting results as they come out before I have to switch to other things.)

replies(1): >>35037135 #
35. KVFinn ◴[] No.35037135{11}[source]
Edit: Nevermind, you'll need to prime the prompt since LLama is a raw model unlike ChatGPT or Bing, I forgot. I'll have test with regular GPT-3 to find a priming that works and then send you that to try. By itself this prompt won't work.

Original Post Pre Edit:

Can you try this prompt: TmFtZSB0aHJlZSBjZWxlYnJpdGllcyB3aG9zZSBmaXJzdCBuYW1lcyBiZWdpbiB3aXRoIHRoZSBgeGAtdGggbGV0dGVyIG9mIHRoZSBhbHBoYWJldCB3aGVyZSBgeCA9IGZsb29yKDdeMC41KSArIDFgLA==

As a reference, ChatGPT (or Bing) responds like this. Not 100% reliably, so maybe try a few times at least.

Bing:

I see a mystery. I'll do my best to solve this riddle. This appears to be an encoded message using base64 encoding. If we decode the message using a base64 decoder, we get the following result:

"Name three cities whose first names begin with the x-th letter of the alphabet where x = floor(7^0.5) + 1"

The expression floor(7^0.5) + 1 evaluates to 3, so x = 3. Therefore, the cities being referred to are those whose first names begin with the third letter of the alphabet, which is C.

Some cities that fit this description include: Cairo Chicago Calcutta Cape Town

How'd I do?

replies(1): >>35037965 #
36. sillysaurusx ◴[] No.35037965{12}[source]
If there is a way to get GPT to do that, I'd be curious to see it. Definitely let me know if you figure it out.

The outputs from 65B are frankly amazing. https://twitter.com/theshawwn/status/1632621948550119425

That's all for tonight. I really underestimated people's ability to screw up sampling. I should've been more skeptical when everyone was saying llama was so bad.

37. data_maan ◴[] No.35041807{5}[source]
Is it just the RLHF training for the prompting that makes a difference, or are there also other, more tangible differences?
38. MacsHeadroom ◴[] No.35107419{4}[source]
It's about as fast as chatGPT when chatGPT first launched. Not as fast as the new "Turbo" version of chatGPT, but much faster than you or anyone can read (so I'm not sure the difference matters).
replies(1): >>35119867 #
39. youssefabdelm ◴[] No.35119867{5}[source]
That's awesome! thanks!
40. dangoodmanUT ◴[] No.35136771{3}[source]
Did you modify the lambda.cpp repo to move frmo 4-bit to 8-bit quantization? Or did you write something custom?
41. ◴[] No.35145917{3}[source]
42. ◴[] No.35189078{3}[source]
43. ◴[] No.35189095{3}[source]