Most active commenters

jart(9)
pharmakom(7)
(7)
alanfranz(6)
bugglebeetle(4)
codexon(4)
eru(4)
rfoo(4)
shakow(3)
somesortofsystm(3)

Popular/hot comments

>>35394288 #
>>35394335 #
>>35396778 #
>>35394244 #
>>35396982 #
>>35397484 #
>>35397515 #
>>35397551 #
>>35397582 #
>>35397835 #
>>35397842 #
>>35399619 #

←back to thread

Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

1. jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

>>35393284 (OP) #

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

2. nynx ◴[31 Mar 23 21:24 UTC] No.35393868[source]▶

>>35393615 (TP) #

Why is it behaving sparsely? There are only dense operations, right?

replies(2): >>35394105 #>>35399718 #

3. eternalban ◴[31 Mar 23 21:31 UTC] No.35393942[source]▶

>>35393615 (TP) #

Great work. Is the new file format described anywhere? Skimming the issue comments I have a vague sense that r/o matter was colocated somewhere for zero copy mmap or is there more to it?

replies(1): >>35395072 #

4. conradev ◴[31 Mar 23 21:45 UTC] No.35394089[source]▶

>>35393615 (TP) #

> But we don't have a compelling enough theory yet to explain the RAM usage miracle.

My guess would be that the model is faulted into memory lazily page by page (4K or 16K chunks) as the model is used, so only the actual parts that are needed are loaded.

The kernel also removes old pages from the page cache to make room for new ones, and especially so if the computer is using a lot of its RAM. As with all performance things, this approach trades off inference speed for memory usage, but likely faster overall because you don't have to read the entire thing from disk at the start. Each input will take a different path through the model, and will require loading more of it.

The cool part is that this memory architecture should work just fine with hardware acceleration, too, as long as the computer has unified memory (anything with an integrated GPU). This approach likely won't be possible with dedicated GPUs/VRAM.

This approach _does_ still work to run a dense model with limited memory, but the time/memory savings would just be less. The GPU doesn't multiply every matrix in the file literally simultaneously, so the page cache doesn't need to contain the entire model at once.

replies(2): >>35394240 #>>35394335 #

5. zone411 ◴[31 Mar 23 21:46 UTC] No.35394097[source]▶

>>35393615 (TP) #

It really shouldn't act as a sparse model. I would bet on something being off.

6. w1nk ◴[31 Mar 23 21:46 UTC] No.35394105[source]▶

>>35393868 #

I also have this question, yes it should be. The forward pass should require accessing all the weights AFAIK.

replies(2): >>35394210 #>>35396386 #

7. world2vec ◴[31 Mar 23 21:47 UTC] No.35394107[source]▶

>>35393615 (TP) #

>I'm glad you're happy with the fact that LLaMA 30B (a 20gb file) can be evaluated with only 4gb of memory usage!

Isn't LLaMA 30B a set of 4 files (60,59Gb)?

-edit- nvm, It's quantized. My bad

8. sillysaurusx ◴[31 Mar 23 21:56 UTC] No.35394203[source]▶

>>35393615 (TP) #

Hey, I saw your thoughtful comment before you deleted it. I just wanted to apologize — I had no idea this was a de facto Show HN, and certainly didn’t mean to make it about something other than this project.

The only reason I posted it is because Facebook had been DMCAing a few repos, and I wanted to reassure everyone that they can hack freely without worry. That’s all.

I’m really sorry if I overshadowed your moment on HN, and I feel terrible about that. I’ll try to read the room a little better before posting from now on.

Please have a wonderful weekend, and thanks so much for your hard work on LLaMA!

EDIT: The mods have mercifully downweighted my comment, which is a relief. Thank you for speaking up about that, and sorry again.

If you'd like to discuss any of the topics you originally posted about, you had some great points.

replies(2): >>35396261 #>>35400237 #

9. d3nj4l ◴[31 Mar 23 21:56 UTC] No.35394208[source]▶

>>35393615 (TP) #

Maybe off topic, but I just wanted to say that you're an inspiration!

10. ◴[31 Mar 23 21:57 UTC] No.35394210{3}[source]▶

>>35394105 #

11. liuliu ◴[31 Mar 23 22:00 UTC] No.35394240[source]▶

>>35394089 #

Metal only recent versions (macOS 13 / iOS 16) supports mmap and use that in GPU directly. CUDA does have unified memory mode even it is dedicated GPU, would be interesting to try that out. Probably going to slow down quite a bit, but still interesting to have that possibility.

12. htrp ◴[31 Mar 23 22:00 UTC] No.35394244[source]▶

>>35393615 (TP) #

Just shows how inefficient some of the ML research code can be

replies(4): >>35394895 #>>35394991 #>>35395797 #>>35396415 #

13. sr-latch ◴[31 Mar 23 22:01 UTC] No.35394259[source]▶

>>35393615 (TP) #

Have you tried running it against a quantized model on HuggingFace with identical inputs and deterministic sampling to check if the outputs you're getting are identical? I think that should confirm/eliminate any concern of the model being evaluated incorrectly.

14. intelVISA ◴[31 Mar 23 22:03 UTC] No.35394288[source]▶

>>35393615 (TP) #

Didn't expect to see two titans today: ggerganov AND jart. Can ya'll slow down you make us mortals look bad :')

Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.

Perhaps SWE is dead after all, but LLMs didn't kill it...

replies(11): >>35395112 #>>35395145 #>>35395165 #>>35395404 #>>35396298 #>>35397484 #>>35398972 #>>35399367 #>>35400001 #>>35400090 #>>35456064 #

15. jart ◴[31 Mar 23 22:07 UTC] No.35394335[source]▶

>>35394089 #

I don't think it's actually trading away inference speed. You can pass an --mlock flag, which calls mlock() on the entire 20GB model (you need root to do it), then htop still reports only like 4GB of RAM is in use. My change helps inference go faster. For instance, I've been getting inference speeds of 30ms per token after my recent change on the 7B model, and I normally get 200ms per eval on the 30B model.

replies(7): >>35394430 #>>35394797 #>>35395345 #>>35395412 #>>35395525 #>>35395565 #>>35396256 #

16. thomastjeffery ◴[31 Mar 23 22:12 UTC] No.35394408[source]▶

>>35393615 (TP) #

How diverse is the training corpus?

replies(1): >>35394827 #

17. conradev ◴[31 Mar 23 22:15 UTC] No.35394430{3}[source]▶

>>35394335 #

Very cool! Are you testing after a reboot / with an empty page cache?

replies(1): >>35394709 #

18. jart ◴[31 Mar 23 22:38 UTC] No.35394709{4}[source]▶

>>35394430 #

Pretty much. I do my work on a headless workstation that I SSH into, so it's not like competing with Chrome tabs or anything like that. But I do it mostly because that's what I've always done. The point of my change is you won't have to be like me anymore. Many of the devs who contacted after using my change have been saying stuff like, "yes! I can actually run LLaMA without having to close all my apps!" and they're so happy.

replies(1): >>35400835 #

19. Miraste ◴[31 Mar 23 22:47 UTC] No.35394797{3}[source]▶

>>35394335 #

This is incredible, great work. Have you tried it with the 65B model? Previously I didn't have a machine that could run it. I'd love to know the numbers on that one.

20. dchest ◴[31 Mar 23 22:50 UTC] No.35394827[source]▶

>>35394408 #

https://arxiv.org/abs/2302.13971

replies(1): >>35394950 #

21. smaddox ◴[31 Mar 23 22:55 UTC] No.35394881[source]▶

>>35393615 (TP) #

Based on that discussion, it definitely sounds like some sort of bug is hiding. Perhaps run some evaluations to compare perplexity to the standard implementation?

Edit: looks like there's now confirmation that running it on a 10GB VM slows inference down massively, so looks like the only thing strange is the memory usage reading on some systems.

22. robrenaud ◴[31 Mar 23 22:57 UTC] No.35394895[source]▶

>>35394244 #

Training tends to require a lot more precision and hence memory than inference. I bet many of the tricks here won't work well for training.

23. thomastjeffery ◴[31 Mar 23 23:03 UTC] No.35394950{3}[source]▶

>>35394827 #

Is there any measure, not of size or token amount, but of diversity in the content of the text?

Did that metric meaningfully change when the amount of required memory dropped?

If the amount of diversity is lowered, I would expect that to lower the amount of patterns to be modeled from the text. If that is the case, then the resulting model size itself would be lowered, during and after training.

replies(1): >>35395333 #

24. actually_a_dog ◴[31 Mar 23 23:07 UTC] No.35394991[source]▶

>>35394244 #

As a former grad student, I can tell you, that's all research code, not just ML, or even "performance-oriented" research code.

25. jart ◴[31 Mar 23 23:16 UTC] No.35395072[source]▶

>>35393942 #

That's something I'm working on presently.

26. diimdeep ◴[31 Mar 23 23:19 UTC] No.35395091[source]▶

>>35393615 (TP) #

Is the title misleading here ?

30B quantized requires 19.5 GB, not 6GB; Otherwise severe swapping to disk

  model    original size   quantized size (4-bit)
  7B     13 GB    3.9 GB
  13B    24 GB    7.8 GB
  30B    60 GB    19.5 GB
  65B    120 GB   38.5 GB

replies(2): >>35395206 #>>35395944 #

27. ok123456 ◴[31 Mar 23 23:22 UTC] No.35395112[source]▶

>>35394288 #

You can mmap from python.

replies(2): >>35397073 #>>35400396 #

28. MontyCarloHall ◴[31 Mar 23 23:24 UTC] No.35395145[source]▶

>>35394288 #

>how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions

Probably not all that much. All of the Python numeric computing frameworks (Numpy, PyTorch, TensorFlow, etc.) are basically just wrappers for lower level C++/C/Fortran code. Unless you’re doing something boneheaded and converting framework-native tensors to Python objects, passing tensors around within a framework essentially just passes a pointer around, which has marginal overhead even when encapsulated in a bloated Python object.

Indeed, a huge number of PyTorch operations are explicitly zero copy: https://pytorch.org/docs/stable/tensor_view.html

replies(2): >>35396982 #>>35408170 #

29. catchnear4321 ◴[31 Mar 23 23:26 UTC] No.35395165[source]▶

>>35394288 #

Money did.

Why waste developer hours (meaning effort) if you can just scale the infra for a little cash? Do it in small enough increments and the increases only outweigh FTEs if you consider all scaling events and look at a long enough time scale.

Suddenly it takes way too much for way too little, but it cost half as many overpaid developers who can’t be arsed to performance.

Edit: in case that sounds like the opposite of intended, ggerganov and jart are the outliers, the exception.

30. renewiltord ◴[31 Mar 23 23:32 UTC] No.35395206[source]▶

>>35395091 #

That's the size on disk, my man. When you quantize it to a smaller float size you lose precision on the weights and so the model is smaller. Then here they `mmap` the file and it only needs 6 GiB of RAM!

replies(2): >>35395584 #>>35400189 #

31. alchemist1e9 ◴[31 Mar 23 23:37 UTC] No.35395249[source]▶

>>35393615 (TP) #

I’m hopeful that when especially skilled developers like you have banged on minimizing inference resources, the others and you will start looking at distributed training ideas. Probably there is a way to decentralize the training so we can all throw in our GPUs together on building the most useful models for code generation than can be free to use and relatively cheap to run inference on. If you have any thoughts on that side of the LLM space I’m sure we would all be super curious to hear them.

Thank you for the amazing work. It’s so appreciated by so many on HN like me I’m sure.

32. actually_a_dog ◴[31 Mar 23 23:46 UTC] No.35395333{4}[source]▶

>>35394950 #

By "diversity," do you mean something like "entropy?" Like maybe

    H_s(x) := -\sum_{x \in X_s} p(x) log(p(x))

where X_s := all s-grams from the training set? That seems like it would eventually become hard to impossible to actually compute. Even if you could what would it tell you?

Or, wait... are you referring to running such an analysis on the output of the model? Yeah, that might prove interesting....

replies(1): >>35415417 #

33. startupsfail ◴[31 Mar 23 23:48 UTC] No.35395345{3}[source]▶

>>35394335 #

Hmm, can you try running iotop, to see how much is being red from disk? Is it 20Gb or 6Gb? Maybe the prefetch is able to fill in before page faults are happening? Or maybe you are hitting the disk cache?

34. shakow ◴[31 Mar 23 23:55 UTC] No.35395404[source]▶

>>35394288 #

> Perhaps SWE is dead after all, but LLMs didn't kill it...

Cheap electronics did. 32GB of RAM is maybe $150, a developer converting & maintaining your system to use mmap is $150k/year.

replies(2): >>35395692 #>>35397724 #

35. ImprobableTruth ◴[31 Mar 23 23:55 UTC] No.35395412{3}[source]▶

>>35394335 #

What does it look like if used context size increases?

36. shakow ◴[01 Apr 23 00:12 UTC] No.35395525{3}[source]▶

>>35394335 #

Disk accesses should not lie. If only 6GiB are read from the disk, then I believe either the model is indeed sparse in its computation, or there may be a bug somewhere.

replies(1): >>35395843 #

37. leereeves ◴[01 Apr 23 00:16 UTC] No.35395565{3}[source]▶

>>35394335 #

> You can pass an --mlock flag, which calls mlock() on the entire 20GB model (you need root to do it), then htop still reports only like 4GB of RAM is in use.

How is that possible? Is the model being compressed even more (even after converting to 4 bit) somehow? Or is most of the model unused?

replies(2): >>35398012 #>>35400557 #

38. ◴[01 Apr 23 00:17 UTC] No.35395584{3}[source]▶

>>35395206 #

39. xmprt ◴[01 Apr 23 00:31 UTC] No.35395692{3}[source]▶

>>35395404 #

This still doesn't make sense. It doesn't take a full year to do optimizations like this. Maybe a month at most if you include the investigation time. And the memory usage is $150 times the number of users which is in the thousands at least.

replies(1): >>35395793 #

40. jart ◴[01 Apr 23 00:46 UTC] No.35395793{4}[source]▶

>>35395692 #

Tragedy of the commons. If you want to do something that benefits everyone a little bit, and you can't productize it like OpenAI's $20/month subscription, then there's no rational economic reason to do it, and you have to wait for someone like me who has an irrational love of coding. It's not a lifestyle that makes you rich, but it does help you see the opportunities to fix problems that the well-resourced folks who are supposed to be solving them would never even notice; in fact, they'd probably think you're trolling them if you ever brought it up.

replies(2): >>35396016 #>>35399394 #

41. rvz ◴[01 Apr 23 00:47 UTC] No.35395797[source]▶

>>35394244 #

Exactly.

It also shows the number of impostors in this thread and inflated titles of self-proclaimed 'seniors' who can't optimize ML code to even be on the same league as Tunney (jart), and Gerganov (ggerganov).

Not even ChatGPT or Copilot could even submit a change or in-fact completely rewrite and optimize this code like they have done.

replies(1): >>35397055 #

42. outofpaper ◴[01 Apr 23 00:52 UTC] No.35395843{4}[source]▶

>>35395525 #

You couldn't have said it clearer.

43. parano1d ◴[01 Apr 23 00:55 UTC] No.35395858[source]▶

>>35393615 (TP) #

Could mmap also be used to improve the memory usage of whisper.cpp?

44. xiphias2 ◴[01 Apr 23 01:10 UTC] No.35395944[source]▶

>>35395091 #

Now it's clear that there was a bug in the measurement. The author used a machine with lots of RAM, so I guess most of us are still stuck with quantized 13B. Still, the improvement hopefully translates, and I hope that 30B will run with 3 bit quantization in a few days.

replies(1): >>35399319 #

45. bestcoder69 ◴[01 Apr 23 01:19 UTC] No.35395995[source]▶

>>35393615 (TP) #

Thanks for this! I was able to integrate alpaca-30B into a slack bot & a quick tkinter GUI (coded by GPT-4 tbh) by just shelling out to `./main` in both cases, since model loading is so quick now. (I didn't even have to ask GPT-4 to code me up Python bindings to llama's c-style api!)

replies(1): >>35402054 #

46. axlee ◴[01 Apr 23 01:21 UTC] No.35396016{5}[source]▶

>>35395793 #

Tragedy of the commons only work for things you don't directly pay for.

replies(2): >>35396838 #>>35397255 #

47. microtherion ◴[01 Apr 23 01:57 UTC] No.35396256{3}[source]▶

>>35394335 #

> htop still reports only like 4GB of RAM is in use

I think that's just an accounting thing. Many UNIX variants will not "charge" read only memory mapped pages to a process, because they could be shared among many processes and evicted at will.

48. hackernewds ◴[01 Apr 23 01:58 UTC] No.35396261[source]▶

>>35394203 #

great apology :)

49. gct ◴[01 Apr 23 02:04 UTC] No.35396298[source]▶

>>35394288 #

This doesn't even seem that clever, just regular ol' use of mmap where there was none before. Wonder what other performance is being left on the floor. I'm convinced entire power plants could be retired if the world stopped using python unfortunately.

replies(1): >>35396471 #

50. ◴[01 Apr 23 02:19 UTC] No.35396386{3}[source]▶

>>35394105 #

51. alduin32 ◴[01 Apr 23 02:23 UTC] No.35396415[source]▶

>>35394244 #

For now we've just shown how measuring memory consumption can be tricky at times.

52. ChatPGT ◴[01 Apr 23 02:35 UTC] No.35396471{3}[source]▶

>>35396298 #

>> I'm convinced entire power plants could be retired if the world stopped using python unfortunately.

On the other hand, many business and professionals wouldn't exist :)

replies(1): >>35396778 #

53. sn_master ◴[01 Apr 23 03:23 UTC] No.35396778{4}[source]▶

>>35396471 #

I can't find a single good argument for Python based on merit that's not at least 15+ years dated and stems from "But Google is using it".

It's not the easiest syntax, not the best compiler support, performance and threading is a joke. The entire language is based on hype back from the time when the only two mainstream languages were C++ and Java.

replies(7): >>35397047 #>>35397059 #>>35397110 #>>35397339 #>>35397398 #>>35397582 #>>35398349 #

54. nablags ◴[01 Apr 23 03:32 UTC] No.35396838{6}[source]▶

>>35396016 #

well in a way - open source software something that you don’t directly pay for

replies(1): >>35397151 #

55. oceanplexian ◴[01 Apr 23 04:02 UTC] No.35396982{3}[source]▶

>>35395145 #

It’s not that the performance is the issue, it’s that it’s unmaintainable and prone to break. Exceptions aren’t handled right, dependencies are a disaster (Proprietary NVIDIA drivers+CUDA+PyTorch+ the various versions of stuff are a complete disaster)

This leads to all sorts of bugs and breaking changes that are cool in an academic or hobbyist setting but a total headache on a large production system.

replies(3): >>35397515 #>>35397551 #>>35398182 #

56. Swizec ◴[01 Apr 23 04:13 UTC] No.35397047{5}[source]▶

>>35396778 #

Python has the ecosystem. That’s it. The lingua franca of data science. At this point it doesn’t even matter anymore why.

Just like you’re not gonna usurp JavaScript on the web.

57. visarga ◴[01 Apr 23 04:14 UTC] No.35397055{3}[source]▶

>>35395797 #

Remember this moment when you're about to criticise LLMs. People can act suboptimal too, even experts.

58. 6510 ◴[01 Apr 23 04:15 UTC] No.35397059{5}[source]▶

>>35396778 #

Before an NDA send him to Rura Penthe I use to have an internet friend pedantic about seemingly useless compilers and interpreters. Quests like: use obscure language A to translate obscure language B to obscure language C. Then use B compiled to C to interpret D.

A long story short, in the future the AI can just convert all our code to FORTH or HolyC or some "creative" combination of languages chosen by prophecy (read: hallucination) perhaps even Python — as a show of strength.

59. westurner ◴[01 Apr 23 04:17 UTC] No.35397073{3}[source]▶

>>35395112 #

The CPython mmap module docs: https://docs.python.org/3/library/mmap.html

zero_buffer (CFFI, 2013) https://github.com/alex/zero_buffer/blob/master/zero_buffer....

"Buffers on the edge: Python and Rust" (2022) https://alexgaynor.net/2022/oct/23/buffers-on-the-edge/ :

> If you have a Python object and want to obtain its buffer, you can do so with memoryview in Python or PyObject_GetBuffer in C. If you’re defining a class and want to expose a buffer, you can do so in Python by… actually you can’t, only classes implemented in C can implement the buffer protocol. To implement the buffer protocol in C, you provide the bf_getbuffer and bf_releasebuffer functions which are called to obtain a buffer from an object and when that buffer is being released, respectively.

iocursor (CPython C API, ~Rust std::io::Cursor) https://github.com/althonos/iocursor

Arrow Python (C++) > On disk and MemoryMappedFile s: https://arrow.apache.org/docs/python/memory.html#on-disk-and...

"Apache Arrow: Read DataFrame With Zero Memory" (2020) https://towardsdatascience.com/apache-arrow-read-dataframe-w...

pyarrow.Tensor: https://arrow.apache.org/docs/python/generated/pyarrow.Tenso...

ONNX is built on protocolbuffers/protobufs (google/protobufs), while Arrow is built on google/flatbuffers.

FlatBuffers https://en.wikipedia.org/wiki/FlatBuffers :

> It supports “zero-copy” deserialization, so that accessing the serialized data does not require first copying it into a separate part of memory. This makes accessing data in these formats much faster than data in formats requiring more extensive processing, such as JSON, CSV, and in many cases Protocol Buffers. Compared to other serialization formats however, the handling of FlatBuffers requires usually more code, and some operations are not possible (like some mutation operations).

60. codexon ◴[01 Apr 23 04:22 UTC] No.35397110{5}[source]▶

>>35396778 #

There were plenty of other languages competing with python for the same niche such as perl, ruby, js, php etc... Python is superior to all of those just for syntax alone, it is easier and cleaner to both read and write.

replies(2): >>35397342 #>>35397723 #

61. sli ◴[01 Apr 23 04:28 UTC] No.35397151{7}[source]▶

>>35396838 #

More saliently, the overwhelming majority of the Linux kernel's direct and extended userbase has contributed nothing at all directly to the Linux kernel, as just one example.

62. ElectricalUnion ◴[01 Apr 23 04:52 UTC] No.35397255{6}[source]▶

>>35396016 #

Exactly, the software supplier isn't paying for RAM.

63. ec109685 ◴[01 Apr 23 05:05 UTC] No.35397318[source]▶

>>35393615 (TP) #

Before, it had to read the 16GB into a buffer, which itself could be written to disk if the system needs to page out.

That is a big reason startup time is fast.

64. eru ◴[01 Apr 23 05:10 UTC] No.35397339{5}[source]▶

>>35396778 #

Couldn't you level the same argument against eg C++?

65. eru ◴[01 Apr 23 05:11 UTC] No.35397342{6}[source]▶

>>35397110 #

That might be true, but it seems to generally fall under the category of 'relevant 15+ years ago', doesn't it?

replies(2): >>35397459 #>>35402983 #

66. elbear ◴[01 Apr 23 05:23 UTC] No.35397398{5}[source]▶

>>35396778 #

It's not the easiest syntax?

It's the easiest among most popular languages. It uses the least amount of symbols, parenthesis and braces only for values.

Some people don't like the significant whitespace, but that helps readability.

replies(2): >>35397842 #>>35398281 #

67. rybar ◴[01 Apr 23 05:36 UTC] No.35397459{7}[source]▶

>>35397342 #

How do you qualify relevancy? Your own personal bubble and bias? Adoption and usage?

Pull requests and stars on github? That might be a start.

https://madnight.github.io/githut/#/pull_requests/2022/4 https://madnight.github.io/githut/#/stars/2022/4

Though you may say but but alltheprivaterepos! Then I challenge you to back up what you mean by relevance and prove python is a category of relevant 15+ years ago.

68. rfoo ◴[01 Apr 23 05:40 UTC] No.35397484[source]▶

>>35394288 #

Sigh.

It's not like the zero copy buzzword is going to help you during training, all your weights have to stay on GPU, you are going to sample your training data randomly and your data is on a networked storage anyway, so mmap HURTS. You'd better just O_DIRECT.

Similarly, as long as you run your inference on GPU it's not like you can mmap... And I have indeed worked on inference runtimes for mobile devices and on the rare cases we need to use CPU only (hey, your phone also have GPUs since forever) at $PREVIOUS_JOB we did have a mmap-able model format, it also helps in TEE/SGX/whatever enclave tech. Oh, and there are no Python at all.

The recent development of ggml is interesting as it catches a moment that "big ML shop infra" guys don't care: running models on Apple Silicon. M1/M2s are expensive enough that we don't consider deploying them instead of those 1000000000 bizarre accelerators in production, yet everyone on HN seems to have one and hey it's fast enough for LMs. They are rather unique as they are CPU+high bandwidth RAM+accelerators with totally shared RAM with CPU, instead of some GPU shit.

tldr it's not like "big ML shop infra" guys are stupid and leaves performance on table. They just don't run their production workload on MacBooks. That's where the community shine right?

replies(3): >>35402161 #>>35408183 #>>35429776 #

69. StillBored ◴[01 Apr 23 05:45 UTC] No.35397499[source]▶

>>35393615 (TP) #

Took a look at it, did you try MAP_HUGETLB? This looks like the kind of application that can gain very large runtime advantages from avoiding TLB pressure. It might take a bit longer (or fail entirely) on machines where you can't get enough huge pages, but attempting it (or probing for free pages via /proc/meminfo) and then falling back to mapping it without might take slightly longer for the mmap() but the advantages of taking an order of magnitude (assuming you can get 1G pages) fewer TLB misses might be worth it.

70. rfoo ◴[01 Apr 23 05:48 UTC] No.35397515{4}[source]▶

>>35396982 #

The stack is very volatile and unmaintainable because it doesn't need to be maintainable. Exactly why we have unmaintainable software in other domains. During the last 10 years there are ALWAYS totally new model architecture with new operations (or in case of CV new bizarre uses of Conv). By the time you get your performant perfectly maintainable masterpiece ready it's not needed anymore. The stack optimizes for flexibility and iteration speed naturally, just like why people use Rails.

In fact I'd love to see that Transformer really dominates. We can then start to converge on software. And compute-wise transformers are really simple, too!

replies(3): >>35399833 #>>35401019 #>>35401942 #

71. CurrentB ◴[01 Apr 23 05:59 UTC] No.35397551{4}[source]▶

>>35396982 #

Yeah, I've been using python for the first time in a while to try out some of the llm stuff and I can't believe how bad the dependency hell is. It's probably particularly bad due to the pace of change in this field. But I spend an hour getting dependencies fixed every time I touch anything. 80% of the Google Collabs I find are just outright broken. I wish there were other viable non python options to try out these things.

replies(3): >>35398107 #>>35399048 #>>35424595 #

72. alanfranz ◴[01 Apr 23 06:04 UTC] No.35397582{5}[source]▶

>>35396778 #

Python is the 2nd best language for everything.

It doesn’t excel at anything, but anything a software can do, it can be done in Python somehow.

So, a great pick when you’ve got no idea where you’re going to, when you’re prototyping, when you don’t care about performance or perfection.

I agree that for large scale systems when you already know what you’re doing, Python shows its limits quite soon (and we should add the problems with missing/slow type checking that slows down large scale systems development).

replies(3): >>35397835 #>>35399566 #>>35399581 #

73. ElFitz ◴[01 Apr 23 06:29 UTC] No.35397723{6}[source]▶

>>35397110 #

> Python is superior to all of those just for syntax alone, it is easier and cleaner to both read and write.

Do you have any argument to support this, aside from personal bias?

replies(1): >>35403231 #

74. pdntspa ◴[01 Apr 23 06:29 UTC] No.35397724{3}[source]▶

>>35395404 #

So let's toss management and go write good code for the principle of it, and not business bullshit calculus

replies(1): >>35400498 #

75. pharmakom ◴[01 Apr 23 06:49 UTC] No.35397835{6}[source]▶

>>35397582 #

> Python is the 2nd best language for everything.

Huh? Why?

You can barely deploy it to Web.

it doesn't scale perfoance wise

you can't built robust abstractions

The REPL is merely OK

You can barely ship working code without containers

the syntax is hard to manipulate programmatically

Python has inertia but it's holding us back

replies(3): >>35398468 #>>35398487 #>>35399387 #

76. pharmakom ◴[01 Apr 23 06:51 UTC] No.35397842{6}[source]▶

>>35397398 #

> It uses the least amount of symbols, parenthesis and braces only for values.

is there any evidence that this makes it easier?

people learn python as beginners because it has a reputation for being easy for beginners

I don't see anything about the syntax that makes it inherently easier

replies(3): >>35398511 #>>35400204 #>>35400470 #

77. pritambaral ◴[01 Apr 23 07:20 UTC] No.35398012{4}[source]▶

>>35395565 #

mmap-ed memory pages backed by a file that aren't dirty aren't counted in an process's RSS usage, only kernel page cache. The mmap-ed regions of virtual memory does get counted in VSZ (virtual memory) but that is just virtual and can be larger than RAM+swap.

78. 0xakhil ◴[01 Apr 23 07:26 UTC] No.35398037[source]▶

>>35393615 (TP) #

Some OS’s zram compress the unpinned pages instead of swapping to disk. It might be faster than fetching the pages again from disk. I wonder if this is a reason why folks see different results.

79. rcarmo ◴[01 Apr 23 07:32 UTC] No.35398083[source]▶

>>35393615 (TP) #

This is nothing short of legendary. Was following the thread on Twitter and LOLed at the replies of “Praise be Jart”, but there’s something of the sublime here. Great weight wrangling judo :)

replies(1): >>35400066 #

80. bboygravity ◴[01 Apr 23 07:37 UTC] No.35398107{5}[source]▶

>>35397551 #

No idea what a Google Collab is, but does the code come with an environment or at least a specifications of which packages and versions to use (requirements.txt)?

It sounds unnecessarily weird to me that people would share Python code that simply doesn't work out at all out of the box.

replies(2): >>35398243 #>>35402417 #

81. eastWestMath ◴[01 Apr 23 07:51 UTC] No.35398182{4}[source]▶

>>35396982 #

I was in a PLT group in grad school going into robotics. I could spend all day ranting about how Python is just completely unsuitable for professional software development. Even something like F# would be an enormous improvement.

82. version_five ◴[01 Apr 23 08:02 UTC] No.35398243{6}[source]▶

>>35398107 #

Its rarely as easy as sharing a requirements.txt. There are lots of things that can still break - for examples you get weird situations where different modules require different versions of a third module. Or all the Cuda toolkit version issues thsy seem to come up with gpu stuff. When we share python, we tend to share a docker image, and even this isn't foolproof. A big problem I think is that it doesn't incentivize building something portable. And it's very hard to test across different machines. Add to that all the different practices re virtual environments, venv, conda, etc, everyone tries to install the dependencies differently or is starting from some nonstandard state. It's a mess.

replies(1): >>35400039 #

83. kgwgk ◴[01 Apr 23 08:07 UTC] No.35398281{6}[source]▶

>>35397398 #

> Some people don't like the significant whitespace, but that helps readability.

Compared to what? Unindented or badly indented code in other languages?

In other languages you can move code around and it still works - and nobody prevents you from adding whitespace for readeability (it may be even done automatically for you).

84. laichzeit0 ◴[01 Apr 23 08:18 UTC] No.35398349{5}[source]▶

>>35396778 #

It’s not like there’s a gun to anyone’s head forcing them to use Python. The ecosystem (library, framework, IDEs) is what draws people to use it.

If there was a superior alternative that covers the breadth of the Python ecosystem I’m pretty sure no one would have any scruples in using it. A programming language and its syntax is the least interesting or complex part when it comes to solving problems. Just rattling off some amazing libraries I've used over the last few years:

https://scikit-image.org - Image processing

https://imgaug.readthedocs.io - Image augmentation

https://scikit-learn.org/stable - ML

https://pymoo.org - Multi objective optimization

https://simpy.readthedocs.io/ - Discrete event simulation

https://lifelines.readthedocs.io - Survival analysis

https://bambinos.github.io/bambi - Bayesian modeling

https://unit8co.github.io/darts/ - Time series forecasting

https://abydos.readthedocs.io/en/latest/abydos.distance.html - Basically any string distance metric you can think of

The list just goes on and on.. oh yeah, some Deep Learning libraries too, which some people find useful.

replies(1): >>35399652 #

85. Manjuuu ◴[01 Apr 23 08:29 UTC] No.35398427[source]▶

>>35393615 (TP) #

Thanks for all the nice projects you are releasing for free, btw.

86. l33tman ◴[01 Apr 23 08:36 UTC] No.35398468{7}[source]▶

>>35397835 #

Well for starters, web deployment isn't "everything". Python is the de-facto go-to language for research or general prototyping, where not everyone is a programming wiz keeping track of the latest trendy new compiled language. Not everyone can compile stuff even.. :)

Having said that, I've deployed two large Django projects on the web with tons of customers and it runs and scales just fine, and it's a DREAM to maintain and develop for than for example Java.. I would go so far as to say the opposite, if you haven't used Python for web deployment you've been missing out! (you lose some efficiency I'm sure but you gain other things)

replies(1): >>35399210 #

87. renox ◴[01 Apr 23 08:39 UTC] No.35398487{7}[source]▶

>>35397835 #

You have good points but "the syntax is hard to manipulate programmatically"??

Maybe you haven't noticed but Lisp is now a tiny niche and most new languages aren't homoiconic either..

replies(2): >>35399215 #>>35407814 #

88. MattPalmer1086 ◴[01 Apr 23 08:44 UTC] No.35398511{7}[source]▶

>>35397842 #

Anecdata, but I learned Python many years ago precisely because I found the syntax was clear.

I liked the one way of doing most things philosophy, coming off working on a large C++ code base.

replies(1): >>35399223 #

89. d0mine ◴[01 Apr 23 10:11 UTC] No.35398972[source]▶

>>35394288 #

It is easy to use mmap from Python. You can do zero copy too e.g., https://stackoverflow.com/questions/17244488/reading-struct-... (see mmap+frombuffer example)

Though in practice, in many cases, mmap won't be faster, it can be even slower than open+read.

90. hunta2097 ◴[01 Apr 23 10:25 UTC] No.35399048{5}[source]▶

>>35397551 #

You're using virtual environments, right?

ML libraries are particularly bad, most other stuff works well.

Friends don't let friends install pip into /usr/lib.

replies(1): >>35406036 #

91. pharmakom ◴[01 Apr 23 11:01 UTC] No.35399210{8}[source]▶

>>35398468 #

I was talking about running in the Web browser. it's not everything, but it's an important part of everything in my book.

replies(1): >>35402261 #

92. pharmakom ◴[01 Apr 23 11:03 UTC] No.35399215{8}[source]▶

>>35398487 #

I don't think that proves anything. If we had "JavaLisp" in the browser instead of JavaScript then Lisp would be very popular. Besides that, Python is harder to manipulate than many non-Lisps, such as JavaScript and Go.

replies(1): >>35399805 #

93. pharmakom ◴[01 Apr 23 11:04 UTC] No.35399223{8}[source]▶

>>35398511 #

Being more clean than C++ doesn't prove much :)

replies(1): >>35437342 #

94. diimdeep ◴[01 Apr 23 11:21 UTC] No.35399319{3}[source]▶

>>35395944 #

Also, current SSD's achieve 7.5 GB/s+ read speed, opposed to older SSD from 2013 with 500 MB/s, so performance will drastically differ depending on your system specs in case of pulling weights from disk to RAM on demand. Also, there is $ vmmap <pid> where we can see various statistics about process memory and used swap, that are not available in top or htop.

replies(1): >>35400659 #

95. somesortofsystm ◴[01 Apr 23 11:28 UTC] No.35399367[source]▶

>>35394288 #

> such clever use of mmap

Just wanna say, that this use of mmap() is cleverly used in this context, but should be acknowledged as a widely accepted industry standard practice for getting higher performance, particularly in embedded applications but also in performance-oriented apps such as digital audio workstations, video editing systems, and so on.

replies(1): >>35400131 #

96. alanfranz ◴[01 Apr 23 11:31 UTC] No.35399387{7}[source]▶

>>35397835 #

What is another language working that well in a larger number of areas?

replies(1): >>35399619 #

97. somesortofsystm ◴[01 Apr 23 11:31 UTC] No.35399394{5}[source]▶

>>35395793 #

>Tragedy of the commons.

Tragedy of folks forgetting how to program.

This mmap() "trick" isn't a trick, its a standard practice for anyone who has cut their teeth on POSIX or embedded. See also mlock()/munlock() ..

replies(1): >>35399782 #

98. smallerfish ◴[01 Apr 23 12:03 UTC] No.35399566{6}[source]▶

>>35397582 #

To steal from another thread, Python is the McDonald's of languages - it's ubiquitous, it doesn't take much effort, and it's really not very good.

The trope about it being the 2nd best language for everything isn't correct. It's taught in universities because it has a very short time to gratification, and the basic syntax is quite intuitive. Academics latched onto it for ML because of some excellent libraries, and it became established as a vital part of the ecosystem from there.

But it's a nightmare to support a moderate to large codebase in production, packaging continues to be a mess, and it's full of weird quirks. Great for weekend projects, but for pete's sake take a minute and port them into something more reliable before going to production with them.

replies(1): >>35399858 #

99. sigi64 ◴[01 Apr 23 12:07 UTC] No.35399581{6}[source]▶

>>35397582 #

Python, the language with global interpret lock, Is not the 2nd best language for everything, especially in the age od multicore processors.

replies(1): >>35403218 #

100. pharmakom ◴[01 Apr 23 12:13 UTC] No.35399619{8}[source]▶

>>35399387 #

Clojure

JavaScript

Typescript

OCaml

Haskell

replies(3): >>35401049 #>>35401969 #>>35433667 #

101. HarHarVeryFunny ◴[01 Apr 23 12:19 UTC] No.35399652{6}[source]▶

>>35398349 #

>It’s not like there’s a gun to anyone’s head forcing them to use Python. The ecosystem (library, framework, IDEs) is what draws people to use it

Sure, but that is the gun, especially (as reflected in your examples) for machine learning. The best frameworks (PyTorch, TensorFlow, JAX) are all Python, with support for other languages being an afterthought as best.

The use of scripting languages (Python, Lua - original Torch) for ML seems to have started partly because he original users were non-developers, more from a math/stats background, and partly because an interactive REPL loop is good for a field like this that is very experimental/empirical.

Does it make sense that we're now building AGI using a scripting language? Not really, but that's where we are!

102. HarHarVeryFunny ◴[01 Apr 23 12:27 UTC] No.35399718[source]▶

>>35393868 #

From what I've read there's no evidence it's "behaving sparsely".. That was just offered as a suggestion why it might not be loading all the weights, but makes no sense in terms of the model. It's going to be using all the weights.

Another suggestion is that not all of the word/token embedding table might be used, which would be a function of the input used to test, but that would be easy enough to disprove as there would then be different memory usage for different inputs.

It seems possible the reported memory usage is lower than reality if that's how mmap/top work. In any case, a good use of mmap it seems, especially since for a multi-layer model layer weights will be used sequentially so paged load-on-demand will work relatively well even in a low memory situation.

103. jart ◴[01 Apr 23 12:36 UTC] No.35399782{6}[source]▶

>>35399394 #

Well that's exactly the thing. They haven't. We're talking about a group of people here who live inside scientific papers and jupyter notebooks. They're able to make machines literally think, but you'd be pushing them out of their comfort zone if you stuck them in front of something like Emacs with C. Some people like GG, Jeff Dean, etc. are strong in both skill sets, but they're outliers.

104. renox ◴[01 Apr 23 12:38 UTC] No.35399805{9}[source]▶

>>35399215 #

Python became popular without being the 'web language', the Lisps didn't.

replies(1): >>35399840 #

105. nunobrito ◴[01 Apr 23 12:41 UTC] No.35399833{5}[source]▶

>>35397515 #

Still a poor excuse. Had they written this in Java and things wouldn't be so difficult both on performance and maintainability.

Never understood why people think that indented languages are any simpler when in fact they bring all kinds of trouble for getting things done.

replies(2): >>35400733 #>>35403267 #

106. pharmakom ◴[01 Apr 23 12:42 UTC] No.35399840{10}[source]▶

>>35399805 #

Curly brace languages are more popular again.

107. alanfranz ◴[01 Apr 23 12:44 UTC] No.35399858{7}[source]▶

>>35399566 #

I think you’re focusing too much on the letter, rather than the idea.

108. a-dub ◴[01 Apr 23 13:03 UTC] No.35400001[source]▶

>>35394288 #

you'd be surprised how many professional programmers these days work exclusively in high level languages and know nothing about using operating system features to their fullest.

but to your point, until technology itself actually replaces us, deeply skilled computer people are always going to be able to squeeze more performance out of software implemented in high level languages by those who have not studied computers extensively.

109. pablo1107 ◴[01 Apr 23 13:11 UTC] No.35400039{7}[source]▶

>>35398243 #

Maybe using Nix it's a better experience for creating such an environment where you depending also on system utilities.

replies(1): >>35401252 #

110. shock-value ◴[01 Apr 23 13:16 UTC] No.35400066[source]▶

>>35398083 #

It appears that this was just a misreading of how memory usage was being reported and there was actually no improvement here. At least nothing so sensational as being able to run a larger-than-RAM model without swapping from disk on every iteration.

replies(1): >>35400685 #

111. deadly_syn ◴[01 Apr 23 13:18 UTC] No.35400090[source]▶

>>35394288 #

Thank you for saying it out loud, I thought I was going crazy!

112. jart ◴[01 Apr 23 13:23 UTC] No.35400131{3}[source]▶

>>35399367 #

Just because mmap() is commonly used doesn't mean it's commonly understood. Yes, it powers just about everything important in terms of the skeletons of our local systems. So why has the thought of using it occurred to so few people until now? Almost a whole generation has passed since things like mmap() were relegated to "the work's been done!" category of computing. People moved on to caring about things like My Browser and The Cloud where mmap() doesn't exist. Most people don't know about it. The ones who do, are reluctant to use it. Scientific computing projects are totally devoted to supporting MSVC (since you just know data scientists are secretly using those GPUs for gaming) so any thought devs may have had previously about using mmap() would have certainly triggered fears w.r.t. WIN32 before any chance to fully consider the true depth of its value would kick in. Plus data migrations are very difficult to pull off. It worked here due to the outpouring of community support, since people were blocked on this. But for a corporation with tons of cash to burn, it's a harder sell.

replies(1): >>35411574 #

113. gliptic ◴[01 Apr 23 13:33 UTC] No.35400189{3}[source]▶

>>35395206 #

The size mentioned is already quantized (and to integers, not floats). mmap obviously doesn't do any quantization.

114. whynotminot ◴[01 Apr 23 13:36 UTC] No.35400204{7}[source]▶

>>35397842 #

What languages are you comparing it against?

Python is more readable than C. Way better than C++. Far simpler to reason about than Java. Maybe Typescript is on a similar level, but throwing a beginner into the JS ecosystem can be daunting. Perhaps Ruby could be argued as equally simple, but it feels like that's a dead end language these days. Golang is great, but probably not as easy to get rolling with as Python.

What else? Are you going to recommend some niche language no one hires for?

replies(1): >>35402694 #

115. Kiro ◴[01 Apr 23 13:40 UTC] No.35400237[source]▶

>>35394203 #

I don't think you need to apologize for your comment. Even if I post a Show HN I expect no special treatment on my comments. That would be ridiculous and this is not even a Show HN.

116. lostmsu ◴[01 Apr 23 14:03 UTC] No.35400396{3}[source]▶

>>35395112 #

In fact, you can mmap from PyTorch directly.

replies(1): >>35401406 #

117. hasmanean ◴[01 Apr 23 14:13 UTC] No.35400470{7}[source]▶

>>35397842 #

And what symbols it has, it reuses them wisely.

The square brackets alone make it a winner. Array, list and strings indexing. Dictionary lookups. Slices and substrings. List comprehensions. The notations convenience of this alone is immense.

Built in list, string, and dicts. For the 90% of code that is not performance critical, this is a godsend. Just looking at the c++ syntax for this makes me never want to use a stl data structure for anything trivial.

118. shakow ◴[01 Apr 23 14:17 UTC] No.35400498{4}[source]▶

>>35397724 #

Good, tell me how your company will be doing.

What people sometimes fail to understand is that code is a mean to an end, not an end in itself.

If you want to make code for itself, work on an opensource and/or personal project. If you are paid to work on something, you're paid for the something to get out, not for it to feature the best code ever.

replies(1): >>35400892 #

119. ◴[01 Apr 23 14:29 UTC] No.35400557{4}[source]▶

>>35395565 #

120. freehorse ◴[01 Apr 23 14:42 UTC] No.35400659{4}[source]▶

>>35399319 #

Even with 7.5GB/s you are gonna at best achieve 2.7 seconds for a computing a token, in a hyperoptimistic scenario that you can actually achieve that speed in reading the file, which is too slow for doing much. Maybe if one could get the kernel to swap more aggressively or sth it could cut half that time or so, but it still would be quite slow.

121. jart ◴[01 Apr 23 14:45 UTC] No.35400685{3}[source]▶

>>35400066 #

Please read the original link to the pull request, where I stated my change offered a 2x improvement in memory usage. You actually are able to load models 2x larger without compromising system stability, because pages are no longer being copied. That's because you previously needed 40gb of RAM to load a 20GB model, in order to ensure your file cache wasn't destroyed and need to reread from disk the next time. Now you only need 20GB to load a 20GB model.

The peculiarity here is that tools like htop were reporting the improvement as being an 8x improvement, which is interesting, because RAM use is only 2x better due to my change. The rusage.com page fault reporting was also interesting too. This is not due to sparseness. It's because htop was subtracting MAP_SHARED memory. The htop docs say on my computer that the color purple is used to display shared memory, and yellow is used to display kernel file caches. But it turned out it just uses yellow for both, even though it shouldn't, because mincore() reported that the shared memory had been loaded into the resident set size.

replies(1): >>35402607 #

122. revelio ◴[01 Apr 23 14:50 UTC] No.35400733{6}[source]▶

>>35399833 #

There's a Java ML library called Tribuo that might be worth looking at.

replies(1): >>35400868 #

123. dekhn ◴[01 Apr 23 15:01 UTC] No.35400835{5}[source]▶

>>35394709 #

Linux has a command to drop caches at runtime (https://www.tecmint.com/clear-ram-memory-cache-buffer-and-sw...) which is VERY useful during debugging.

124. dopidopHN ◴[01 Apr 23 15:06 UTC] No.35400868{7}[source]▶

>>35400733 #

Thanks, the boring aspect of Java is appealing here.

125. pdntspa ◴[01 Apr 23 15:11 UTC] No.35400892{5}[source]▶

>>35400498 #

With the margins that tech makes, many companies could certainly afford to care more about code quality. But they don't, instead it gets stuffed into cash reserves where the money sits idle, doing nothing but enriching shareholders.

Or hiring useless business people to install around the periphery of engineering. Which is funny because now tech is letting all those folks go.

126. wootland ◴[01 Apr 23 15:27 UTC] No.35401019{5}[source]▶

>>35397515 #

Does this mean it would be easy to move off Python all together? It seems like the problem stems from everyone using pytorch at the base layer. How realistic is it recreate those apis in another, more modern language. Coding in Rust, Go... then distributing a single binary vs. pip hell seems like it would be worth it.

replies(2): >>35401402 #>>35403195 #

127. alanfranz ◴[01 Apr 23 15:31 UTC] No.35401049{9}[source]▶

>>35399619 #

Any JVM language or .NET language will take more to interface with native libraries, it’s not the same.

Ocaml is very niche, I feel it’s an hard sell for a general purpose language. Haskell, 3x that.

JS and TS, could be. But are they so much better than Python, if better at all?

replies(2): >>35402387 #>>35416211 #

128. superkuh ◴[01 Apr 23 15:54 UTC] No.35401252{8}[source]▶

>>35400039 #

Everyone is using llama.cpp because we reject the idea of giving up on system libraries like nix does. That kind of tomfoolery (at least in the desktop context) is only required when you use software projects that use libraries/languages which break forwards compatibility every 3 years.

If you just write straight c++ (without c++xx, or anything like it) you can compile the code on machines from decades ago if you want.

replies(2): >>35402510 #>>35405209 #

129. sroussey ◴[01 Apr 23 16:11 UTC] No.35401402{6}[source]▶

>>35401019 #

Go would be interesting for the reason you could send an executable.

I’d love for JS/TS to dominate as well. Use ‘bun bun’ to send an executable if need be, but also use in in web backends.

130. ok123456 ◴[01 Apr 23 16:11 UTC] No.35401406{4}[source]▶

>>35400396 #

and numpy

131. colinsane ◴[01 Apr 23 17:07 UTC] No.35401942{5}[source]▶

>>35397515 #

> The stack optimizes for flexibility and iteration speed naturally

“unmaintainable” (as in “i’m spending an hour each day sorting out which dep update broke my project”) usually gets in the way of the former point.

132. xxpor ◴[01 Apr 23 17:10 UTC] No.35401969{9}[source]▶

>>35399619 #

Outside of typescript, this feels like a response from a decade ago, when Python was still mired in the 2 vs 3 problem.

What's happened to the popularity of all of these languages since 2010? Outside of JS/TS, absolutely nothing. If anything, they've lost mindshare.

133. bugglebeetle ◴[01 Apr 23 17:19 UTC] No.35402054[source]▶

>>35395995 #

What’s your setup for running these? I’m not seeing performance improvements on off the shelf hardware that would allow for this.

replies(1): >>35407026 #

134. sroussey ◴[01 Apr 23 17:32 UTC] No.35402161{3}[source]▶

>>35397484 #

On a Mac, mmap definitely works for the GPU since it’s all the same unified memory.

135. sireat ◴[01 Apr 23 17:43 UTC] No.35402261{9}[source]▶

>>35399210 #

https://github.com/pyodide/pyodide is pretty amazing for running Python client side in the browser.

You could run notebooks entirely client side https://jupyterlite.readthedocs.io/en/latest/

The startup is slow but otherwise it is pretty functional.

136. revelio ◴[01 Apr 23 17:57 UTC] No.35402387{10}[source]▶

>>35401049 #

Native library interfacing isn't really Python's strong suit, interpreter plugins are quite painful to write.

.NET has P/Invoke which is much nicer.

JVM is getting Panama+jextract, which is the nicest yet. You can go straight from header files to pure Java bindings which don't need any extra native code at all. But it's not shipped yet :(

replies(2): >>35403179 #>>35405781 #

137. vkou ◴[01 Apr 23 18:00 UTC] No.35402417{6}[source]▶

>>35398107 #

> No idea what a Google Collab is

It's ~equivalent to a Jupyter notebook.

138. ◴[01 Apr 23 18:09 UTC] No.35402510{9}[source]▶

>>35401252 #

139. shock-value ◴[01 Apr 23 18:20 UTC] No.35402607{4}[source]▶

>>35400685 #

It's obviously a productive change and kudos for taking it on, but much of the enthusiasm being generated here was driven by the entirely unanticipated prospect of running a model at full speed using less memory than the model's own footprint, and by the notion that inference with a dense model somehow behaved in a sparse manner at runtime. Best to be a bit more grounded here, particularly with regard to claims that defy common understanding.

replies(2): >>35403354 #>>35412531 #

140. vkou ◴[01 Apr 23 18:28 UTC] No.35402694{8}[source]▶

>>35400204 #

> Far simpler to reason about than Java.

Strong disagreement. Explicit types make reasoning about Java much easier, especially when you are in an unfamiliar codebase.

Python is not quite the 'write-only' language of Perl, but it is a lot easier to write it than it is to read it.

replies(1): >>35403239 #

141. AnthonyMouse ◴[01 Apr 23 18:57 UTC] No.35402974[source]▶

>>35393615 (TP) #

Gosh, thank you for getting to this before I did. The first thing I said when I saw it loading tens of GB from the disk on each run is, is there some reason they're not using mmap?

This isn't just a matter of making the 30B model run in 6GB or whatever. You can now run the largest model, without heavy quantization, and let the OS figure it out. It won't be as fast as having "enough" memory, but it will run.

In theory you could always have done this with swap, but swap is even slower because evictions have to be written back to swap (and wear out your SSD if your swap isn't on glacially slow spinning rust) instead of just discarded because the OS knows where to read it back from the filesystem.

This should also make it much more efficient to run multiple instances at once because they can share the block cache.

(I wonder if anybody has done this with Stable Diffusion etc.)

142. codexon ◴[01 Apr 23 18:58 UTC] No.35402983{7}[source]▶

>>35397342 #

I'm arguing against the point that it clearly did have the easiest syntax compared to the competition back then and not because Google was using it.

Even if it doesn't have the best syntax now (which I doubt), the tooling and libraries make it a better choice over any language that have an edge over python syntax.

replies(1): >>35407253 #

143. baq ◴[01 Apr 23 19:18 UTC] No.35403179{11}[source]▶

>>35402387 #

Python has had cffi since figuratively forever, so I’m not sure why you compare native modules to P/Invoke?

replies(1): >>35408954 #

144. rfoo ◴[01 Apr 23 19:19 UTC] No.35403195{6}[source]▶

>>35401019 #

Check https://pytorch.org/tutorials/advanced/cpp_frontend.html

You can easily build a standalone binary (well, it would be GiB+ if you use CUDA... but that's the cost of statically linking cu*), had you coded your model and training loop in C++.

It then happily runs everywhere as long as a NVIDIA GPU driver is available (don't need to install CUDA).

Protip: Your AI research team REALLY DON'T WANT TO DO THIS BECAUSE THEY LOVE PYTHON. Having Python, even with the dependency management shit, is a feature, not a bug.

(if you want Rust / Go and don't want to wrapping libtorch/tf then you have a lot of work to do but yeah it's possible. also there are model compiler guys [1] where the promise is model.py in model.o out you just link it with your code)

[1] https://mlc.ai

145. baq ◴[01 Apr 23 19:21 UTC] No.35403218{7}[source]▶

>>35399581 #

Python is the practical language for when you do your cpu intensive tasks outside of it as a feature, since the GIL isn’t a problem with io parallelism.

You’d do better complaining about still nascent, compared to alternatives, async support or lack of jit in the official implementation.

146. codexon ◴[01 Apr 23 19:22 UTC] No.35403231{7}[source]▶

>>35397723 #

I can make some arguments but it all boils down to personal bias and anecdotes.

The forced use of spacing to delineate blocks means you will never see a bunch of brackets eating up screen space and the common error where someone adds another line to an if statement but doesn't add braces.

Semicolons not being conventional means less screen noise and less code golf 1 liners.

The focus on imperative vs functional means you rarely ever see something like this a(b(c(d(e(f(g))))

PHP suffers greatly from poorly named standard functions on top of all of that.

Don't get me started on Ruby metaprogramming.

These are just the things I could think of off the top of my head. I do not want to spend my afternoon on this. This is just my experience looking at code for over 20 years, you either believe it or you don't. There's no scientific studies to prove that 1 syntax feature is superior.

I highly doubt that everyone chose python just because Google did. Python was a giant step in syntax compared to the competition back then, and now even if there is a new language out there right now that has a better syntax, it's not going to be better by much, and it is not going to have the tooling, libraries, or the community.

replies(1): >>35417354 #

147. baq ◴[01 Apr 23 19:23 UTC] No.35403239{9}[source]▶

>>35402694 #

Python is getting typescript like typing support. Slowly, yes, but way better than Java’s type system.

148. rfoo ◴[01 Apr 23 19:26 UTC] No.35403267{6}[source]▶

>>35399833 #

There's deeplearning4j (from Theano days!), go figure why it didn't take off.

149. ◴[01 Apr 23 19:34 UTC] No.35403334[source]▶

>>35393615 (TP) #

150. jart ◴[01 Apr 23 19:36 UTC] No.35403354{5}[source]▶

>>35402607 #

I wanted it to be sparse. Doesn't matter if it wasn't. We're already talking about how to modify the training and evaluation to make it sparser. That's the next logical breakthrough in getting inference for larger models running on tinier machines. If you think I haven't done enough to encourage skepticism, then I'd remind you that we all share the same dream of being able to run these large language models on our own. I can't control how people feel. Especially not when the numbers reported by our tools are telling us what we want to be true.

151. remexre ◴[01 Apr 23 23:19 UTC] No.35405209{9}[source]▶

>>35401252 #

What's c++xx?

replies(1): >>35407666 #

152. alanfranz ◴[02 Apr 23 00:36 UTC] No.35405781{11}[source]▶

>>35402387 #

What is an “interpreter plugin?” Writing a Python C extension is not that painful, it’s quite well supported. And you’ve got cffi and ctypes as well.

153. AnthonyMouse ◴[02 Apr 23 01:12 UTC] No.35406036{6}[source]▶

>>35399048 #

This just goes to show what a mess this is.

Suppose you have a big piece of compute hardware (e.g. at a university) which is shared by multiple users. They all want to come in and play with these models. Each one is tens to hundreds of gigabytes. Is each user supposed to have their own copy in their home directory?

replies(1): >>35406562 #

154. Accujack ◴[02 Apr 23 02:35 UTC] No.35406562{7}[source]▶

>>35406036 #

This is not exactly a new problem.

replies(1): >>35406640 #

155. AnthonyMouse ◴[02 Apr 23 02:46 UTC] No.35406640{8}[source]▶

>>35406562 #

That's kind of the point. We solved this problem decades ago. You have a system package manager that installs a system-wide copy of the package that everybody can use.

But now we encounter this broken nonsense because solved problems get unsolved by bad software.

156. MacsHeadroom ◴[02 Apr 23 03:56 UTC] No.35407026{3}[source]▶

>>35402054 #

I host a llama-13B IRC chatbot on a spare old android phone.

replies(1): >>35407147 #

157. bugglebeetle ◴[02 Apr 23 04:22 UTC] No.35407147{4}[source]▶

>>35407026 #

Have a repo anywhere?

replies(1): >>35418741 #

158. eru ◴[02 Apr 23 04:45 UTC] No.35407253{8}[source]▶

>>35402983 #

> I'm arguing against the point that it clearly did have the easiest syntax compared to the competition back then and not because Google was using it.

Maybe, not sure? My point was that both the syntax and Google using it was more relevant 15 years ago than now.

(I don't have much of an opinion on the 15+ years ago thing.)

replies(1): >>35408022 #

159. opless ◴[02 Apr 23 06:06 UTC] No.35407666{10}[source]▶

>>35405209 #

C++11, and greater.

replies(1): >>35416499 #

160. kazinator ◴[02 Apr 23 06:41 UTC] No.35407814{8}[source]▶

>>35398487 #

Here, we can set Lisp aside and take grandparent comment's definition of syntax to be concrete, character-level syntax.

Python concrete syntax is harder to manipulate programmatically compared to Javascript concrete syntax.

For instance, to insert one statement into another, we need to traverse the lines of that syntax and add the right amount of indentation. We can't just plant the syntax into the desired spot and be done with it.

161. codexon ◴[02 Apr 23 07:17 UTC] No.35408022{9}[source]▶

>>35407253 #

I don't see any reason for it to be less true now.

Is python syntax worse than any brand new languages like rust or go? Absolutely not. It's still better.

Did Google stop using it? I don't think so, but I also don't think people picked it just because Google did.

replies(1): >>35417974 #

162. miraculixx ◴[02 Apr 23 07:42 UTC] No.35408170{3}[source]▶

>>35395145 #

What a bad take!

Python is not the cause of dependency hell. Deep dependency trees are. The only way to deal with this is to use seperate environments and to carefully specify the exact requirements.

Those who claim some language would be a magical fix clearly lack experience in multiple languages.

replies(1): >>35413971 #

163. miraculixx ◴[02 Apr 23 07:44 UTC] No.35408183{3}[source]▶

>>35397484 #

This! Thank you.

164. revelio ◴[02 Apr 23 10:05 UTC] No.35408954{12}[source]▶

>>35403179 #

Most important libraries with native components are using C extensions, not cffi.

165. somesortofsystm ◴[02 Apr 23 15:29 UTC] No.35411574{4}[source]▶

>>35400131 #

The Cloud has been with us since the birth of computing. What is happening is, the computing industry goes through waves of attrition, whereby the schools push everyone up the Brand New Stack, while industry, frustrated with generations of programmers who can't program, just Builds Another Stack.

Repeat, ad infinitum. In the cracks you'll find people re-learning things they should've known, if only they weren't slagging off the grey beards .. or, even worse .. as grey beards not paying attention to the discoveries of youth.

>Most people don't know about it. The ones who do, are reluctant to use it.

Not so sure about this. The reluctance is emotional, its not technical. Nobody is killing POSIX under all of this - it is deployed. Therefore, learn it.

>so any thought devs may have had previously about using mmap() would have certainly triggered fears w.r.t. WIN32

Does not compute. Own up, you're an AI.

166. ◴[02 Apr 23 16:53 UTC] No.35412531{5}[source]▶

>>35402607 #

167. wnoise ◴[02 Apr 23 19:04 UTC] No.35413971{4}[source]▶

>>35408170 #

It's true nothing forces or forbids this, but some languages/toolings/communities/ecosystems encourage that more than others though.

168. thomastjeffery ◴[02 Apr 23 21:26 UTC] No.35415417{5}[source]▶

>>35395333 #

I'm really just speculating here.

Because the text we write is not evenly distributed random noise, what we encode into it (by writing) is entropy.

Because LLMs model text with inference, they model all of the entropy that is present.

That would mean that the resulting size would be a measure of entropy (sum of patterns) divided by repetition (recurring patterns). In this count, I would consider each unique token alone an instance of the identity pattern.

So to answer both questions: yes.

169. nequo ◴[02 Apr 23 22:58 UTC] No.35416211{10}[source]▶

>>35401049 #

> Ocaml is very niche, I feel it’s an hard sell for a general purpose language. Haskell, 3x that.

The impression about Haskell’s nicheness compared with OCaml prevails. But Haskell has a larger userbase and a larger library ecosystem than OCaml.

replies(1): >>35479230 #

170. remexre ◴[02 Apr 23 23:26 UTC] No.35416499{11}[source]▶

>>35407666 #

Huh, I was proficient in Rust before "properly" learning C++, so maybe that accounts for it, but I didn't realize C++11 was controversial. Is it just move semantics, or are there some library things that are hard to implement?

replies(1): >>35431561 #

171. ElFitz ◴[03 Apr 23 01:17 UTC] No.35417354{8}[source]▶

>>35403231 #

Having not been around when Python gained in popularity, and having mostly been using Node.js and Swift, this is actually quite interesting.

Thanks!

172. eru ◴[03 Apr 23 02:52 UTC] No.35417974{10}[source]▶

>>35408022 #

Python's syntax is ok.

Btw, I wish they would take some inspiration from Haskell's syntax.

Haskell also has significant whitespace, but its defined as syntactic sugar for a more traditionally syntax with curly braces and semicolons.

Approximately no-one uses that curly-brace syntax, but it's good for two things:

- silences the naysayers

- more importantly: allows you to copy-paste code even into forms that mess up your indentation.

replies(1): >>35444338 #

173. MacsHeadroom ◴[03 Apr 23 05:01 UTC] No.35418741{5}[source]▶

>>35407147 #

It's just the same llama.cpp repo everyone else is using. You just git clone it to your android phone in termux and then run make and you're done. https://github.com/ggerganov/llama.cpp

Assuming you have the model file downloaded (you can use wget to download it) these are the instructions to install and run:

pkg install git

pkg install cmake

pkg install build-essential

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make -j

./main

replies(1): >>35418871 #

174. bugglebeetle ◴[03 Apr 23 05:24 UTC] No.35418871{6}[source]▶

>>35418741 #

Yeah, I’ve already been running llama.cpp locally, but not found it to perform at the level attested in the comment (30B model as a chat bot on commodity hardware). 13B runs okay, but inference appears generally too slow on to do anything useful on my MacBook. I wondered what you might be doing to get usable performance in that context.

replies(1): >>35441374 #

175. nyarlathotep_ ◴[03 Apr 23 14:51 UTC] No.35424595{5}[source]▶

>>35397551 #

IME the ML world with Python is a whole mess on top of the existing dependency issues.

I've been very _careful_ too (using pyenv/virtualenvs etc) for dependency management, but with Nvidia driver dependencies and "missing 'sqlite3/bz2' issues related to the underlying interpreter (not to mention issues with different Python3.x versions) I'm lucky to be able to even run a 'hello world' ML sample after an afternoon of fighting with it.

My Ubuntu install w/ Nvidia card only seems to recognize the GPU in some circumstances even when using the same `conda` env. Often this is remedied by rebooting the machine(?).

No idea how companies manage this stuff in production. Absolute minefield that seems to catastrophically break if you sneeze at it.

I'll admit I am not an expert in managing ML envs, but I've dealt with a lot of python environments for typical CRUD stuff, and while rough at times, it was never this bad.

176. danbst ◴[03 Apr 23 19:20 UTC] No.35429776{3}[source]▶

>>35397484 #

in llama.cpp inference runs on CPU, using AVX-2 optimizations. You don't need GPU at all

It runs on my 2015 ThinkPad!

177. int_19h ◴[03 Apr 23 21:30 UTC] No.35431561{12}[source]▶

>>35416499 #

I think what OP is saying is that decades-old systems wouldn't have C++11-compatible compilers on them.

replies(1): >>35441542 #

178. chpatrick ◴[04 Apr 23 01:08 UTC] No.35433667{9}[source]▶

>>35399619 #

I've been using Haskell professionally for 8 years and its ecosystem is laughable compared to Python.

179. MattPalmer1086 ◴[04 Apr 23 10:05 UTC] No.35437342{9}[source]▶

>>35399223 #

Very true!

At the time I evaluated other languages to learn, narrowed it down to Ruby and Python, and picked Python as I felt it had a nicer syntax than Ruby. And the "one way to do things" philosophy. This was back in 2005 or so.

What other languages of that period would you say had a nicer syntax than Python?

180. MacsHeadroom ◴[04 Apr 23 15:37 UTC] No.35441374{7}[source]▶

>>35418871 #

You can change the number of threads llama.cpp uses with the -t argument. By default it only uses 4. For example, if your CPU has 16 physical cores then you can run ./main -m model.bin -t 16

16 cores would be about 4x faster than the default 4 cores. Eventually you hit memory bottlenecks. So 32 cores is not twice as fast as 13 cores unfortunately.

replies(1): >>35443868 #

181. bboygravity ◴[04 Apr 23 15:49 UTC] No.35441542{13}[source]▶

>>35431561 #

And maybe that "C++" is now basically a bunch of different incompatible languages instead of just 1 language, depending on what "xx" is (11, 14, 17, 20, 23, etc).

It's like Python 2 vs Python 3 except even worse.

replies(1): >>35444029 #

182. bugglebeetle ◴[04 Apr 23 18:19 UTC] No.35443868{8}[source]▶

>>35441374 #

Thanks! Will test that out!

183. int_19h ◴[04 Apr 23 18:32 UTC] No.35444029{14}[source]▶

>>35441542 #

In my experience, C++03 code works just fine without changes on a C++11 and C++14 compilers, so no, it's not at all like Python 2/3. The few features that were ripped out were exactly the stuff that pretty much no-one was using for good reasons (e.g. throw-specifications).

184. joquarky ◴[04 Apr 23 18:55 UTC] No.35444338{11}[source]▶

>>35417974 #

In a few years, none of this is going to matter anyway since it is likely we will be able to automatically translate everything cheaply.

185. amrb ◴[05 Apr 23 15:44 UTC] No.35456064[source]▶

>>35394288 #

For the life of me I could never fix torch.load, they'll say just quantization (convert) a model to 4/8bit to make it smaller but you'll get crashes when out of system memory plus no docs.. then you admit defeat by using more swapfile :S

186. irthomasthomas ◴[06 Apr 23 14:43 UTC] No.35468946[source]▶

>>35393615 (TP) #

Authorship of this change is contested.

https://news.ycombinator.com/item?id=35431865

187. alanfranz ◴[07 Apr 23 07:47 UTC] No.35479230{11}[source]▶

>>35416211 #

A few years have passed since I last tried out both languages. Ocaml was sort of approachable, while Haskell required quite a different mindset imho, hence the “nicheness” from the general usage standpoint.

↑