←back to thread

343 points sillysaurusx | 1 comments | | HN request time: 0.208s | source
Show context
v64 ◴[] No.35028738[source]
If anyone is interested in running this at home, please follow the llama-int8 project [1]. LLM.int8() is a recent development allowing LLMs to run in half the memory without loss of performance [2]. Note that at the end of [2]'s abstract, the authors state "This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software." I'm very thankful we have researchers like this further democratizing access to this data and prying it out of the hands of the gatekeepers who wish to monetize it.

[1] https://github.com/tloen/llama-int8

[2] https://arxiv.org/abs/2208.07339

replies(5): >>35028950 #>>35029068 #>>35029601 #>>35030214 #>>35030868 #
swyx ◴[] No.35029601[source]
why is it that these models tend to be released as float16 and converting to int8 is left to the reader? is there something special about training that defaults you to float16?
replies(3): >>35030106 #>>35030321 #>>35033050 #
1. sillysaurusx ◴[] No.35030106[source]
They were trained in fp16, and researchers tend to release whatever format they trained. It’s hard enough to do a large release that it’s best not to try to have too many goals, for the same reason most software projects try not to do too much lest their schedule slip.

Still, I’m a little sad they didn’t release the optimizer weights. It would’ve given us so much valuable info about the dataset, among other benefits.