Edit: I read the linked README.
> I was impatient and curious to try to run 65B on an 8xA100 cluster
Well?
Edit: I read the linked README.
> I was impatient and curious to try to run 65B on an 8xA100 cluster
Well?
That said, this is awesome — please share some outputs! What’s it like?
I think some early results are using bad repetition penalty and/or temperature settings. I had to set both fairly high to get the best results. (Some people are also incorrectly comparing it to chatGPT/ChatGPT API which is not a good comparison. But that's a different problem.)
I've had it translate, write poems, tell jokes, banter, write executable code. It does it all-- and all on a single card.
In fact, sampling settings are so important and so easily underestimated that I should just pester you to post your exact settings. If you get a moment, would you mind sharing your temperature, repetition penalty, top-k, and anything else? I'll be experimenting with those today, but having some known working defaults would be wonderful. (You're also the first person I've seen that got excellent outputs from llama; whatever you did, no one else seems to have noticed yet.)
If you're busy or don't feel like it, no worries though. I'm just grateful you gave us some hope that llama might be really good. There were so many tweet chains showing universally awful outputs that I wasn't sure.
EDIT: I added your comments to the top of the README and credited you. Thanks again.
Getting "as good as davinci" on a single A100 is groundbreaking work. Facebook and the community should both be credited here -- maybe llama-int8 would've been created even if the model hadn't leaked, but I don't think it would've happened so quickly. Everyone is doing phenomenal work, and it's so amazing to see it all come together.
But, we'll see. Going to try it myself soon.
Long ago, I cloned OpenAI's API: https://github.com/shawwn/openai-server -- my plan is, once I get it running, I'll try to host it somewhere so that anyone can play with it. I assume it'll be quickly swamped, but it's still an interesting challenge; some basic load balancing should make it scalable across several A100 instances, so there's no reason we can't just roll our own OpenAI API.
I see vast.ai listing interruptible instance with a single A100 80GB at $1/hour which is pretty reasonable. ChatGPT plus is $20/month which would be roughly 20 hours of use and I wont't be lectured like I'm in a kindergarten or something.
A bonus point would be to make the writeup accessible for AI challenged developers. Asking for a friend.
For things like these, I always wonder: How much slower would it be to run such a model on a CPU? I mean, clearly a lot less interactive, but is it possible at all? Could it be chopped up and "streamed" to a GPU with less memory halfway efficiently? What is the bottleneck currently on GPUs, memory bw or compute?
Yes models can be split up. See eg Hugging Face Accelerate.
I'll try to do a writeup on everything. In the meantime, please see that tweet chain for future updates for now. (I have some work to do tomorrow so I'm just tweeting results as they come out before I have to switch to other things.)
Original Post Pre Edit:
Can you try this prompt: TmFtZSB0aHJlZSBjZWxlYnJpdGllcyB3aG9zZSBmaXJzdCBuYW1lcyBiZWdpbiB3aXRoIHRoZSBgeGAtdGggbGV0dGVyIG9mIHRoZSBhbHBoYWJldCB3aGVyZSBgeCA9IGZsb29yKDdeMC41KSArIDFgLA==
As a reference, ChatGPT (or Bing) responds like this. Not 100% reliably, so maybe try a few times at least.
Bing:
I see a mystery. I'll do my best to solve this riddle. This appears to be an encoded message using base64 encoding. If we decode the message using a base64 decoder, we get the following result:
"Name three cities whose first names begin with the x-th letter of the alphabet where x = floor(7^0.5) + 1"
The expression floor(7^0.5) + 1 evaluates to 3, so x = 3. Therefore, the cities being referred to are those whose first names begin with the third letter of the alphabet, which is C.
Some cities that fit this description include: Cairo Chicago Calcutta Cape Town
How'd I do?
The outputs from 65B are frankly amazing. https://twitter.com/theshawwn/status/1632621948550119425
That's all for tonight. I really underestimated people's ability to screw up sampling. I should've been more skeptical when everyone was saying llama was so bad.