Show HN: Llama-dl – high-speed download of LLaMA, Facebook's 65B GPT model

Show context

v64 ◴[05 Mar 23 11:44 UTC] No.35028738[source]▶

If anyone is interested in running this at home, please follow the llama-int8 project [1]. LLM.int8() is a recent development allowing LLMs to run in half the memory without loss of performance [2]. Note that at the end of [2]'s abstract, the authors state "This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software." I'm very thankful we have researchers like this further democratizing access to this data and prying it out of the hands of the gatekeepers who wish to monetize it.

[1] https://github.com/tloen/llama-int8

[2] https://arxiv.org/abs/2208.07339

replies(5): >>35028950 #>>35029068 #>>35029601 #>>35030214 #>>35030868 #

rnosov ◴[05 Mar 23 12:17 UTC] No.35028950[source]▶

>>35028738 #

Hmmm, the Github repo suggests that you might be able to run the 65B model on a single A100 80gb card. At the moment, the spot price on Google cloud for this card is $1.25/hour which makes it not so crazy expensive...

replies(1): >>35031058 #

nabla9 ◴[05 Mar 23 16:29 UTC] No.35031058[source]▶

>>35028950 #

$1.25/hour is roughly a year of GPU time until it exceeds the price of A100 80GB card.

replies(1): >>35032315 #

metadat ◴[05 Mar 23 18:16 UTC] No.35032315[source]▶

>>35031058 #

I think OP meant that $1.25/hr makes this accessible for people try it out themselves cost effectively, without having to spend thousands or tens of thousands up front to obtain a capable hardware rig.

Obviously $1.25/hr 24/7 does add up quickly, after one month the bill would come to $900.

replies(1): >>35032434 #