Show HN: Llama-dl – high-speed download of LLaMA, Facebook's 65B GPT model

(github.com)

343 points sillysaurusx | 1 comments | 05 Mar 23 04:28 UTC | HN request time: 0s | source

Show context

linearalgebra45 ◴[05 Mar 23 11:25 UTC] No.35028638[source]▶

It's been enough time since this leaked, so my question is why aren't there blog posts already of people blowing their $300 of starter credit with ${cloud_provider} on a few hours' experimentation running inference on this 65B model?

Edit: I read the linked README.

> I was impatient and curious to try to run 65B on an 8xA100 cluster

Well?

replies(2): >>35028936 #>>35030027 #

v64 ◴[05 Mar 23 12:15 UTC] No.35028936[source]▶

>>35028638 #

The compute necessary to run 65B naively was only available on AWS (and perhaps Azure, I don't work with them) and the required instance types have been unavailable to the public recently (it seems everyone had the same idea to hop on this and try to run it). In my other post here [1], the memory requirements have been lowered through other work, and it should now be possible to run the 65B on a provider like CoreWeave.

[1] https://news.ycombinator.com/item?id=35028738

replies(2): >>35029106 #>>35029766 #

MacsHeadroom ◴[05 Mar 23 14:17 UTC] No.35029766[source]▶

>>35028936 #

I'm running LLaMA-65B on a single A100 80GB with 8bit quantization. $1.5/hr on vast.ai

replies(7): >>35030000 #>>35030059 #>>35031427 #>>35136771 #>>35145917 #>>35189078 #>>35189095 #

sillysaurusx ◴[05 Mar 23 14:52 UTC] No.35030059[source]▶

>>35029766 #

Careful though — we need to evaluate llama on its own merits. It’s easy to mess up the quantization in subtle ways, then conclude that the outputs aren’t great. So if you’re seeing poor results vs gpt-3, hold off judgement till people have had time to really make sure the quantized models are >97% the effectiveness of the original weights.

That said, this is awesome — please share some outputs! What’s it like?

replies(1): >>35030162 #

MacsHeadroom ◴[05 Mar 23 15:02 UTC] No.35030162[source]▶

>>35030059 #

The output is at least as good as davinci.

I think some early results are using bad repetition penalty and/or temperature settings. I had to set both fairly high to get the best results. (Some people are also incorrectly comparing it to chatGPT/ChatGPT API which is not a good comparison. But that's a different problem.)

I've had it translate, write poems, tell jokes, banter, write executable code. It does it all-- and all on a single card.

replies(4): >>35030413 #>>35030561 #>>35033070 #>>35041807 #

1. data_maan ◴[06 Mar 23 15:04 UTC] No.35041807{5}[source]▶

>>35030162 #

Is it just the RLHF training for the prompting that makes a difference, or are there also other, more tangible differences?

↑