←back to thread

343 points sillysaurusx | 1 comments | | HN request time: 0.451s | source
Show context
v64 ◴[] No.35028738[source]
If anyone is interested in running this at home, please follow the llama-int8 project [1]. LLM.int8() is a recent development allowing LLMs to run in half the memory without loss of performance [2]. Note that at the end of [2]'s abstract, the authors state "This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software." I'm very thankful we have researchers like this further democratizing access to this data and prying it out of the hands of the gatekeepers who wish to monetize it.

[1] https://github.com/tloen/llama-int8

[2] https://arxiv.org/abs/2208.07339

replies(5): >>35028950 #>>35029068 #>>35029601 #>>35030214 #>>35030868 #
rnosov ◴[] No.35028950[source]
Hmmm, the Github repo suggests that you might be able to run the 65B model on a single A100 80gb card. At the moment, the spot price on Google cloud for this card is $1.25/hour which makes it not so crazy expensive...
replies(1): >>35031058 #
nabla9 ◴[] No.35031058[source]
$1.25/hour is roughly a year of GPU time until it exceeds the price of A100 80GB card.
replies(1): >>35032315 #
metadat ◴[] No.35032315[source]
I think OP meant that $1.25/hr makes this accessible for people try it out themselves cost effectively, without having to spend thousands or tens of thousands up front to obtain a capable hardware rig.

Obviously $1.25/hr 24/7 does add up quickly, after one month the bill would come to $900.

replies(1): >>35032434 #
1. ◴[] No.35032434[source]