←back to thread

1311 points msoad | 1 comments | | HN request time: 0.205s | source
Show context
jart ◴[] No.35393615[source]
Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.
replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #
intelVISA ◴[] No.35394288[source]
Didn't expect to see two titans today: ggerganov AND jart. Can ya'll slow down you make us mortals look bad :')

Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.

Perhaps SWE is dead after all, but LLMs didn't kill it...

replies(11): >>35395112 #>>35395145 #>>35395165 #>>35395404 #>>35396298 #>>35397484 #>>35398972 #>>35399367 #>>35400001 #>>35400090 #>>35456064 #
MontyCarloHall ◴[] No.35395145[source]
>how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions

Probably not all that much. All of the Python numeric computing frameworks (Numpy, PyTorch, TensorFlow, etc.) are basically just wrappers for lower level C++/C/Fortran code. Unless you’re doing something boneheaded and converting framework-native tensors to Python objects, passing tensors around within a framework essentially just passes a pointer around, which has marginal overhead even when encapsulated in a bloated Python object.

Indeed, a huge number of PyTorch operations are explicitly zero copy: https://pytorch.org/docs/stable/tensor_view.html

replies(2): >>35396982 #>>35408170 #
oceanplexian ◴[] No.35396982[source]
It’s not that the performance is the issue, it’s that it’s unmaintainable and prone to break. Exceptions aren’t handled right, dependencies are a disaster (Proprietary NVIDIA drivers+CUDA+PyTorch+ the various versions of stuff are a complete disaster)

This leads to all sorts of bugs and breaking changes that are cool in an academic or hobbyist setting but a total headache on a large production system.

replies(3): >>35397515 #>>35397551 #>>35398182 #
rfoo ◴[] No.35397515[source]
The stack is very volatile and unmaintainable because it doesn't need to be maintainable. Exactly why we have unmaintainable software in other domains. During the last 10 years there are ALWAYS totally new model architecture with new operations (or in case of CV new bizarre uses of Conv). By the time you get your performant perfectly maintainable masterpiece ready it's not needed anymore. The stack optimizes for flexibility and iteration speed naturally, just like why people use Rails.

In fact I'd love to see that Transformer really dominates. We can then start to converge on software. And compute-wise transformers are really simple, too!

replies(3): >>35399833 #>>35401019 #>>35401942 #
1. colinsane ◴[] No.35401942[source]
> The stack optimizes for flexibility and iteration speed naturally

“unmaintainable” (as in “i’m spending an hour each day sorting out which dep update broke my project”) usually gets in the way of the former point.