Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.
Perhaps SWE is dead after all, but LLMs didn't kill it...
Probably not all that much. All of the Python numeric computing frameworks (Numpy, PyTorch, TensorFlow, etc.) are basically just wrappers for lower level C++/C/Fortran code. Unless you’re doing something boneheaded and converting framework-native tensors to Python objects, passing tensors around within a framework essentially just passes a pointer around, which has marginal overhead even when encapsulated in a bloated Python object.
Indeed, a huge number of PyTorch operations are explicitly zero copy: https://pytorch.org/docs/stable/tensor_view.html
Why waste developer hours (meaning effort) if you can just scale the infra for a little cash? Do it in small enough increments and the increases only outweigh FTEs if you consider all scaling events and look at a long enough time scale.
Suddenly it takes way too much for way too little, but it cost half as many overpaid developers who can’t be arsed to performance.
Edit: in case that sounds like the opposite of intended, ggerganov and jart are the outliers, the exception.
It's not the easiest syntax, not the best compiler support, performance and threading is a joke. The entire language is based on hype back from the time when the only two mainstream languages were C++ and Java.
This leads to all sorts of bugs and breaking changes that are cool in an academic or hobbyist setting but a total headache on a large production system.
A long story short, in the future the AI can just convert all our code to FORTH or HolyC or some "creative" combination of languages chosen by prophecy (read: hallucination) perhaps even Python — as a show of strength.
zero_buffer (CFFI, 2013) https://github.com/alex/zero_buffer/blob/master/zero_buffer....
"Buffers on the edge: Python and Rust" (2022) https://alexgaynor.net/2022/oct/23/buffers-on-the-edge/ :
> If you have a Python object and want to obtain its buffer, you can do so with memoryview in Python or PyObject_GetBuffer in C. If you’re defining a class and want to expose a buffer, you can do so in Python by… actually you can’t, only classes implemented in C can implement the buffer protocol. To implement the buffer protocol in C, you provide the bf_getbuffer and bf_releasebuffer functions which are called to obtain a buffer from an object and when that buffer is being released, respectively.
iocursor (CPython C API, ~Rust std::io::Cursor) https://github.com/althonos/iocursor
Arrow Python (C++) > On disk and MemoryMappedFile s: https://arrow.apache.org/docs/python/memory.html#on-disk-and...
"Apache Arrow: Read DataFrame With Zero Memory" (2020) https://towardsdatascience.com/apache-arrow-read-dataframe-w...
pyarrow.Tensor: https://arrow.apache.org/docs/python/generated/pyarrow.Tenso...
ONNX is built on protocolbuffers/protobufs (google/protobufs), while Arrow is built on google/flatbuffers.
FlatBuffers https://en.wikipedia.org/wiki/FlatBuffers :
> It supports “zero-copy” deserialization, so that accessing the serialized data does not require first copying it into a separate part of memory. This makes accessing data in these formats much faster than data in formats requiring more extensive processing, such as JSON, CSV, and in many cases Protocol Buffers. Compared to other serialization formats however, the handling of FlatBuffers requires usually more code, and some operations are not possible (like some mutation operations).
It's the easiest among most popular languages. It uses the least amount of symbols, parenthesis and braces only for values.
Some people don't like the significant whitespace, but that helps readability.
Pull requests and stars on github? That might be a start.
https://madnight.github.io/githut/#/pull_requests/2022/4 https://madnight.github.io/githut/#/stars/2022/4
Though you may say but but alltheprivaterepos! Then I challenge you to back up what you mean by relevance and prove python is a category of relevant 15+ years ago.
It's not like the zero copy buzzword is going to help you during training, all your weights have to stay on GPU, you are going to sample your training data randomly and your data is on a networked storage anyway, so mmap HURTS. You'd better just O_DIRECT.
Similarly, as long as you run your inference on GPU it's not like you can mmap... And I have indeed worked on inference runtimes for mobile devices and on the rare cases we need to use CPU only (hey, your phone also have GPUs since forever) at $PREVIOUS_JOB we did have a mmap-able model format, it also helps in TEE/SGX/whatever enclave tech. Oh, and there are no Python at all.
The recent development of ggml is interesting as it catches a moment that "big ML shop infra" guys don't care: running models on Apple Silicon. M1/M2s are expensive enough that we don't consider deploying them instead of those 1000000000 bizarre accelerators in production, yet everyone on HN seems to have one and hey it's fast enough for LMs. They are rather unique as they are CPU+high bandwidth RAM+accelerators with totally shared RAM with CPU, instead of some GPU shit.
tldr it's not like "big ML shop infra" guys are stupid and leaves performance on table. They just don't run their production workload on MacBooks. That's where the community shine right?
In fact I'd love to see that Transformer really dominates. We can then start to converge on software. And compute-wise transformers are really simple, too!
It doesn’t excel at anything, but anything a software can do, it can be done in Python somehow.
So, a great pick when you’ve got no idea where you’re going to, when you’re prototyping, when you don’t care about performance or perfection.
I agree that for large scale systems when you already know what you’re doing, Python shows its limits quite soon (and we should add the problems with missing/slow type checking that slows down large scale systems development).
Huh? Why?
You can barely deploy it to Web.
it doesn't scale perfoance wise
you can't built robust abstractions
The REPL is merely OK
You can barely ship working code without containers
the syntax is hard to manipulate programmatically
Python has inertia but it's holding us back
is there any evidence that this makes it easier?
people learn python as beginners because it has a reputation for being easy for beginners
I don't see anything about the syntax that makes it inherently easier
It sounds unnecessarily weird to me that people would share Python code that simply doesn't work out at all out of the box.
Compared to what? Unindented or badly indented code in other languages?
In other languages you can move code around and it still works - and nobody prevents you from adding whitespace for readeability (it may be even done automatically for you).
If there was a superior alternative that covers the breadth of the Python ecosystem I’m pretty sure no one would have any scruples in using it. A programming language and its syntax is the least interesting or complex part when it comes to solving problems. Just rattling off some amazing libraries I've used over the last few years:
https://scikit-image.org - Image processing
https://imgaug.readthedocs.io - Image augmentation
https://scikit-learn.org/stable - ML
https://pymoo.org - Multi objective optimization
https://simpy.readthedocs.io/ - Discrete event simulation
https://lifelines.readthedocs.io - Survival analysis
https://bambinos.github.io/bambi - Bayesian modeling
https://unit8co.github.io/darts/ - Time series forecasting
https://abydos.readthedocs.io/en/latest/abydos.distance.html - Basically any string distance metric you can think of
The list just goes on and on.. oh yeah, some Deep Learning libraries too, which some people find useful.
Having said that, I've deployed two large Django projects on the web with tons of customers and it runs and scales just fine, and it's a DREAM to maintain and develop for than for example Java.. I would go so far as to say the opposite, if you haven't used Python for web deployment you've been missing out! (you lose some efficiency I'm sure but you gain other things)
I liked the one way of doing most things philosophy, coming off working on a large C++ code base.
Though in practice, in many cases, mmap won't be faster, it can be even slower than open+read.
Just wanna say, that this use of mmap() is cleverly used in this context, but should be acknowledged as a widely accepted industry standard practice for getting higher performance, particularly in embedded applications but also in performance-oriented apps such as digital audio workstations, video editing systems, and so on.
Tragedy of folks forgetting how to program.
This mmap() "trick" isn't a trick, its a standard practice for anyone who has cut their teeth on POSIX or embedded. See also mlock()/munlock() ..
The trope about it being the 2nd best language for everything isn't correct. It's taught in universities because it has a very short time to gratification, and the basic syntax is quite intuitive. Academics latched onto it for ML because of some excellent libraries, and it became established as a vital part of the ecosystem from there.
But it's a nightmare to support a moderate to large codebase in production, packaging continues to be a mess, and it's full of weird quirks. Great for weekend projects, but for pete's sake take a minute and port them into something more reliable before going to production with them.
Sure, but that is the gun, especially (as reflected in your examples) for machine learning. The best frameworks (PyTorch, TensorFlow, JAX) are all Python, with support for other languages being an afterthought as best.
The use of scripting languages (Python, Lua - original Torch) for ML seems to have started partly because he original users were non-developers, more from a math/stats background, and partly because an interactive REPL loop is good for a field like this that is very experimental/empirical.
Does it make sense that we're now building AGI using a scripting language? Not really, but that's where we are!
Never understood why people think that indented languages are any simpler when in fact they bring all kinds of trouble for getting things done.
but to your point, until technology itself actually replaces us, deeply skilled computer people are always going to be able to squeeze more performance out of software implemented in high level languages by those who have not studied computers extensively.
Python is more readable than C. Way better than C++. Far simpler to reason about than Java. Maybe Typescript is on a similar level, but throwing a beginner into the JS ecosystem can be daunting. Perhaps Ruby could be argued as equally simple, but it feels like that's a dead end language these days. Golang is great, but probably not as easy to get rolling with as Python.
What else? Are you going to recommend some niche language no one hires for?
The square brackets alone make it a winner. Array, list and strings indexing. Dictionary lookups. Slices and substrings. List comprehensions. The notations convenience of this alone is immense.
Built in list, string, and dicts. For the 90% of code that is not performance critical, this is a godsend. Just looking at the c++ syntax for this makes me never want to use a stl data structure for anything trivial.
What people sometimes fail to understand is that code is a mean to an end, not an end in itself.
If you want to make code for itself, work on an opensource and/or personal project. If you are paid to work on something, you're paid for the something to get out, not for it to feature the best code ever.
Or hiring useless business people to install around the periphery of engineering. Which is funny because now tech is letting all those folks go.
Ocaml is very niche, I feel it’s an hard sell for a general purpose language. Haskell, 3x that.
JS and TS, could be. But are they so much better than Python, if better at all?
If you just write straight c++ (without c++xx, or anything like it) you can compile the code on machines from decades ago if you want.
What's happened to the popularity of all of these languages since 2010? Outside of JS/TS, absolutely nothing. If anything, they've lost mindshare.
You could run notebooks entirely client side https://jupyterlite.readthedocs.io/en/latest/
The startup is slow but otherwise it is pretty functional.
.NET has P/Invoke which is much nicer.
JVM is getting Panama+jextract, which is the nicest yet. You can go straight from header files to pure Java bindings which don't need any extra native code at all. But it's not shipped yet :(
Strong disagreement. Explicit types make reasoning about Java much easier, especially when you are in an unfamiliar codebase.
Python is not quite the 'write-only' language of Perl, but it is a lot easier to write it than it is to read it.
Even if it doesn't have the best syntax now (which I doubt), the tooling and libraries make it a better choice over any language that have an edge over python syntax.
You can easily build a standalone binary (well, it would be GiB+ if you use CUDA... but that's the cost of statically linking cu*), had you coded your model and training loop in C++.
It then happily runs everywhere as long as a NVIDIA GPU driver is available (don't need to install CUDA).
Protip: Your AI research team REALLY DON'T WANT TO DO THIS BECAUSE THEY LOVE PYTHON. Having Python, even with the dependency management shit, is a feature, not a bug.
(if you want Rust / Go and don't want to wrapping libtorch/tf then you have a lot of work to do but yeah it's possible. also there are model compiler guys [1] where the promise is model.py in model.o out you just link it with your code)
[1] https://mlc.ai
You’d do better complaining about still nascent, compared to alternatives, async support or lack of jit in the official implementation.
The forced use of spacing to delineate blocks means you will never see a bunch of brackets eating up screen space and the common error where someone adds another line to an if statement but doesn't add braces.
Semicolons not being conventional means less screen noise and less code golf 1 liners.
The focus on imperative vs functional means you rarely ever see something like this a(b(c(d(e(f(g))))
PHP suffers greatly from poorly named standard functions on top of all of that.
Don't get me started on Ruby metaprogramming.
These are just the things I could think of off the top of my head. I do not want to spend my afternoon on this. This is just my experience looking at code for over 20 years, you either believe it or you don't. There's no scientific studies to prove that 1 syntax feature is superior.
I highly doubt that everyone chose python just because Google did. Python was a giant step in syntax compared to the competition back then, and now even if there is a new language out there right now that has a better syntax, it's not going to be better by much, and it is not going to have the tooling, libraries, or the community.
Suppose you have a big piece of compute hardware (e.g. at a university) which is shared by multiple users. They all want to come in and play with these models. Each one is tens to hundreds of gigabytes. Is each user supposed to have their own copy in their home directory?
But now we encounter this broken nonsense because solved problems get unsolved by bad software.
Maybe, not sure? My point was that both the syntax and Google using it was more relevant 15 years ago than now.
(I don't have much of an opinion on the 15+ years ago thing.)
Python concrete syntax is harder to manipulate programmatically compared to Javascript concrete syntax.
For instance, to insert one statement into another, we need to traverse the lines of that syntax and add the right amount of indentation. We can't just plant the syntax into the desired spot and be done with it.
Is python syntax worse than any brand new languages like rust or go? Absolutely not. It's still better.
Did Google stop using it? I don't think so, but I also don't think people picked it just because Google did.
Python is not the cause of dependency hell. Deep dependency trees are. The only way to deal with this is to use seperate environments and to carefully specify the exact requirements.
Those who claim some language would be a magical fix clearly lack experience in multiple languages.
Repeat, ad infinitum. In the cracks you'll find people re-learning things they should've known, if only they weren't slagging off the grey beards .. or, even worse .. as grey beards not paying attention to the discoveries of youth.
>Most people don't know about it. The ones who do, are reluctant to use it.
Not so sure about this. The reluctance is emotional, its not technical. Nobody is killing POSIX under all of this - it is deployed. Therefore, learn it.
>so any thought devs may have had previously about using mmap() would have certainly triggered fears w.r.t. WIN32
Does not compute. Own up, you're an AI.
The impression about Haskell’s nicheness compared with OCaml prevails. But Haskell has a larger userbase and a larger library ecosystem than OCaml.
Btw, I wish they would take some inspiration from Haskell's syntax.
Haskell also has significant whitespace, but its defined as syntactic sugar for a more traditionally syntax with curly braces and semicolons.
Approximately no-one uses that curly-brace syntax, but it's good for two things:
- silences the naysayers
- more importantly: allows you to copy-paste code even into forms that mess up your indentation.
I've been very _careful_ too (using pyenv/virtualenvs etc) for dependency management, but with Nvidia driver dependencies and "missing 'sqlite3/bz2' issues related to the underlying interpreter (not to mention issues with different Python3.x versions) I'm lucky to be able to even run a 'hello world' ML sample after an afternoon of fighting with it.
My Ubuntu install w/ Nvidia card only seems to recognize the GPU in some circumstances even when using the same `conda` env. Often this is remedied by rebooting the machine(?).
No idea how companies manage this stuff in production. Absolute minefield that seems to catastrophically break if you sneeze at it.
I'll admit I am not an expert in managing ML envs, but I've dealt with a lot of python environments for typical CRUD stuff, and while rough at times, it was never this bad.
At the time I evaluated other languages to learn, narrowed it down to Ruby and Python, and picked Python as I felt it had a nicer syntax than Ruby. And the "one way to do things" philosophy. This was back in 2005 or so.
What other languages of that period would you say had a nicer syntax than Python?
It's like Python 2 vs Python 3 except even worse.