Most active commenters
  • kstrauser(6)
  • kragen(5)
  • charleshn(3)
  • zahlman(3)

←back to thread

291 points rbanffy | 40 comments | | HN request time: 1.866s | source | bottom
Show context
sgarland ◴[] No.44004897[source]
> Instead, many reach for multiprocessing, but spawning processes is expensive

Agreed.

> and communicating across processes often requires making expensive copies of data

SharedMemory [0] exists. Never understood why this isn’t used more frequently. There’s even a ShareableList which does exactly what it sounds like, and is awesome.

[0]: https://docs.python.org/3/library/multiprocessing.shared_mem...

replies(8): >>44004956 #>>44005006 #>>44006103 #>>44006145 #>>44006664 #>>44006670 #>>44007267 #>>44013159 #
1. chubot ◴[] No.44006145[source]
Spawning processes generally takes much less than 1 ms on Unix

Spawning a PYTHON interpreter process might take 30 ms to 300 ms before you get to main(), depending on the number of imports

It's 1 to 2 orders of magnitude difference, so it's worth being precise

This is a fallacy with say CGI. A CGI in C, Rust, or Go works perfectly well.

e.g. sqlite.org runs with a process PER REQUEST - https://news.ycombinator.com/item?id=3036124

replies(9): >>44006287 #>>44007950 #>>44008877 #>>44009754 #>>44009755 #>>44009805 #>>44010011 #>>44012318 #>>44013651 #
2. Sharlin ◴[] No.44006287[source]
Unix is not the only platform though (and is process creation fast on all Unices or just Linux?) The point about interpreter init overhead is, of course, apt.
replies(1): >>44007828 #
3. btilly ◴[] No.44007828[source]
Process creation should be fast on all Unices. If it isn't, then the lowly shell script (heavily used in Unix) is going to perform very poorly.
replies(4): >>44009682 #>>44010066 #>>44012374 #>>44013686 #
4. charleshn ◴[] No.44007950[source]
> Spawning processes generally takes much less than 1 ms on Unix

It depends on whether one uses clone, fork, posix_spawn etc.

Fork can take a while depending on the size of the address space, number of VMAs etc.

replies(2): >>44009524 #>>44009676 #
5. LPisGood ◴[] No.44008877[source]
My understanding is that spawning a thread takes just a few micro seconds, so whether you’re talking about a process or a Python interpreter process there are still orders of magnitude to be gained.
6. crackez ◴[] No.44009524[source]
Fork on Linux should use copy-on-write vmpages now, so if you fork inside python it should be cheap. If you launch a new Python process from let's say the shell, and it's already in the buffer cache, then you should only have to pay the startup CPU cost of the interpreter, since the IO should be satisfied from buffer cache...
replies(1): >>44010504 #
7. knome ◴[] No.44009676[source]
for glibc and linux, fork just calls clone. as does posix_spawn, using the flag CLONE_VFORK.
8. kragen ◴[] No.44009682{3}[source]
While I think you've been using Unix longer than I have, shell scripts are known for performing very poorly, and on PDP-11 Unix (where perhaps shell scripts were most heavily used, since Perl didn't exist yet) fork() couldn't even do copy-on-write; it had to literally copy the process's entire data segment, which in most cases also contained a copy of its code. Moving to paged machines like the VAX and especially the 68000 family made it possible to use copy-on-write, but historically speaking, Linux has often been an order of magnitude faster than most other Unices at fork(). However, I think people mostly don't use those Unices anymore. I imagine the BSDs have pretty much caught up by now.

https://news.ycombinator.com/item?id=44009754 gives some concrete details on fork() speed on current Linux: 50μs for a small process, 700μs for a regular process, 1300μs for a venti Python interpreter process, 30000–50000μs for Python interpreter creation. This is on a CPU of about 10 billion instructions per second per core, so forking costs on the order of ½–10 million instructions.

9. kragen ◴[] No.44009754[source]
To be concrete about this, http://canonical.org/~kragen/sw/dev3/forkovh.c took 670μs to fork, exit, and wait on the first laptop I tried it on, but only 130μs compiled with dietlibc instead of glibc, and with glibc on a 2.3 GHz E5-2697 Xeon, it took 130μs compiled with glibc.

httpdito http://canonical.org/~kragen/sw/dev3/server.s (which launches a process per request) seems to take only about 50μs because it's not linked with any C library and therefore only maps 5 pages. Also, that doesn't include the time for exit() because it runs multiple concurrent child processes.

On this laptop, a Ryzen 5 3500U running at 2.9GHz, forkovh takes about 330μs built with glibc and about 130–140μs built with dietlibc, and `time python3 -c True` takes about 30000–50000μs. I wrote a Python version of forkovh http://canonical.org/~kragen/sw/dev3/forkovh.py and it takes about 1200μs to fork(), _exit(), and wait().

If anyone else wants to clone that repo and test their own machines, I'm interested to hear the results, especially if they aren't in Linux. `make forkovh` will compile the C version.

1200μs is pretty expensive in some contexts but not others. Certainly it's cheaper than spawning a new Python interpreter by more than an order of magnitude.

replies(1): >>44011765 #
10. jaoane ◴[] No.44009755[source]
>Spawning a PYTHON interpreter process might take 30 ms to 300 ms before you get to main(), depending on the number of imports

That's lucky. On constrained systems launching a new interpreter can very well take 10 seconds. Python is ssssslllloooowwwww.

11. morningsam ◴[] No.44009805[source]
>Spawning a PYTHON interpreter process might take 30 ms to 300 ms

Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter, which takes low-single-digit ms as well.

replies(2): >>44010280 #>>44011993 #
12. ori_b ◴[] No.44010011[source]
As another example: I run https://shithub.us with shell scripts, serving a terabyte or so of data monthly (mostly due to AI crawlers that I can't be arsed to block).

I'm launching between 15 and 3000 processes per request. While Plan 9 is about 10x faster at spawning processes than Linux, it's telling that 3000 C processes launching in a shell is about as fast as one python interpreter.

replies(1): >>44010564 #
13. fredoralive ◴[] No.44010066{3}[source]
Python runs on other operating systems, like NT, where AIUI processes are rather more heavyweight.

Not all use cases of Python and Windows intersect (how much web server stuff is a Windows / IIS / SQL Server / Python stack? Probably not many, although WISP is a nice acronym), but you’ve still got to bear it in mind for people doing heavy numpy stuff on their work laptop or whatever.

14. zahlman ◴[] No.44010280[source]
Even when the 'spawn' strategy is used (default on Windows, and can be chosen explicitly on Linux), the overhead can largely be avoided. (Why choose it on Linux? Apparently forking can cause problems if you also use threads.) Python imports can be deferred (`import` is a statement, not a compiler or pre-processor directive), and child processes (regardless of the creation strategy) name the main module as `__mp_main__` rather than `__main__`, allowing the programmer to distinguish. (Being able to distinguish is of course necessary here, to avoid making a fork bomb - since the top-level code runs automatically and `if __name__ == '__main__':` is normally top-level code.)

But also keep in mind that cleanup for a Python process also takes time, which is harder to trace.

Refs:

https://docs.python.org/3/library/multiprocessing.html#conte... https://stackoverflow.com/questions/72497140

replies(1): >>44010625 #
15. charleshn ◴[] No.44010504{3}[source]
> Fork on Linux should use copy-on-write vmpages now, so if you fork inside python it should be cheap.

No, that's exactly the point I'm making, copying PTEs is not cheap on a large address space, woth many VMAs.

You can run a simple python script allocating a large list and see how it affects fork time.

replies(1): >>44011430 #
16. kstrauser ◴[] No.44010564[source]
The interpreter itself is pretty quick:

  ᐅ time echo "print('hi'); exit()" | python
  hi
  
  ________________________________________________________
  Executed in   21.48 millis    fish           external
     usr time   16.35 millis  146.00 micros   16.20 millis
     sys time    4.49 millis  593.00 micros    3.89 millis
replies(1): >>44011700 #
17. kstrauser ◴[] No.44010625{3}[source]
I really wish Python had a way to annotate things you don't care about cleaning up. I don't know what the API would look like, but I imagine something like:

  l = list(cleanup=False)
  for i in range(1_000_000_000): l.append(i)
telling the runtime that we don't need to individually GC each of those tiny objects and just let the OS's process model free the whole thing at once.

Sure, close TCP connections before you kill the whole thing. I couldn't care less about most objects, though.

replies(4): >>44010738 #>>44010861 #>>44010911 #>>44012256 #
18. zahlman ◴[] No.44010738{4}[source]
You'd presumably need to do something involving weakrefs, since it would be really bad if you told Python that the elements can be GCd at all (never mind whether it can be done all at once) but someone else had a reference.

Or completely rearchitect the language to have a model of automatic (in the C sense) allocation. I can't see that ever happening.

replies(1): >>44010924 #
19. duped ◴[] No.44010861{4}[source]
Tbh if you're optimizing python code you've already lost
replies(2): >>44010925 #>>44011731 #
20. Izkata ◴[] No.44010911{4}[source]
There's already a global:

  import gc
  gc.disable()
So I imagine putting more in there to remove objects from the tracking.
replies(1): >>44010998 #
21. kstrauser ◴[] No.44010924{5}[source]
I don't think either of those are true. I'm not arguing against cleaning up objects during the normal runtime. What I'd like is something that would avoid GC'ing objects one-at-a-time at program shutdown.

I've had cases where it took Python like 30 seconds to exit after I'd slurped a large CSV with a zillion rows into RAM. At that time, I'd dreamed of a way to tell Python not to bother free()ing any of that, just exit() and let Linux unmap RAM all at once. If you think about it, there probably aren't that many resources you actually care about individually freeing on exit. I'm certain somewill will prove me wrong, but at a first pass, objects that don't define __del__ or __exit__ probably don't care how you destroy them.

replies(1): >>44011521 #
22. kstrauser ◴[] No.44010925{5}[source]
Run along.
23. kstrauser ◴[] No.44010998{5}[source]
That can go a long way, so long as you remember to manually GC the handful of things you do care about.
replies(2): >>44013519 #>>44016081 #
24. charleshn ◴[] No.44011430{4}[source]
See e.g. https://www.alibabacloud.com/blog/async-fork-mitigating-quer...
25. zahlman ◴[] No.44011521{6}[source]
Ah.

I imagine the problem is that `__del__` could be monkeypatched, so Python doesn't strictly know what needs custom finalization until that moment.

But if you have a concrete proposal, it's likely worth shopping around at https://discuss.python.org/c/ideas/6 or https://github.com/python/cpython/issues/ .

replies(1): >>44011632 #
26. kstrauser ◴[] No.44011632{7}[source]
I might do that. It’s nothing I’ve thought about in depth, just an occasionally recurring idea that bugs me every now and then.
27. ori_b ◴[] No.44011700{3}[source]
My machine is marginally faster for that; I get about 17ms doing that with python, without the print:

    time echo "exit()" | python3

    real    0m0.017s
    user    0m0.014s
    sys     0m0.003s
That's... still pretty slow. Here's a C program, run 100 times:

    range=`seq 100`
    time for i in $range; do ./a.out; done

    real    0m0.038s
    user    0m0.024s
    sys     0m0.016s
And finally, for comparison on Plan 9:

   range=`{seq 2000}
   time rc -c 'for(s in $range){ ./6.nop }'

   0.01u 0.09s 0.16r   rc -c for(s in $range){ ./6.nop }
the C program used was simply:

   int main(void) { return 0; }
Of course, the more real work you do in the program, the less it matters -- but there's a hell of a lot you can do in the time it takes Python to launch.
28. kragen ◴[] No.44011731{5}[source]
On a 64-core machine, Python code that uses all the cores will be modestly faster than single-threaded C, even if all the inner loops are in Python. If you can move the inner loops to C, for example with Numpy, you can do much better still. (Python is still harder to get right than something like C or OCaml, of course, especially for larger programs, but often the smaller amount of code and quicker feedback loop can compensate for that.)
replies(1): >>44030303 #
29. kragen ◴[] No.44011765[source]
On my cellphone forkovh is 700μs and forkovh.py is 3700μs. Qualcomm SDM630. All the forkovh numbers are with 102400 bytes of data.
30. codethief ◴[] No.44011993[source]
> Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter

…which can also be a great source of subtle bugs if you're writing a cross-platform application.

31. Too ◴[] No.44012256{4}[source]
Never experienced this. If this is truly a problem, here is a sledgehammer, just beware it will not close your tcp connections gracefully: os.kill(os.getpid(), signal.SIGKILL).
32. Too ◴[] No.44012318[source]
In Python, if you are spawning processes or even threads in a tight loop you have already lost. Use ThreadPoolExecutor or ProcessPoolExecutor from concurrent.futures instead. Then startup time becomes no factor.
33. saagarjha ◴[] No.44012374{3}[source]
Yes, this is why using shell scripts on macOS is miserable
34. MonkeyClub ◴[] No.44013519{6}[source]
And then we're back to manual memory management.

At least the objects get instantiated automatically, and you don't need to malloc() them into existence yourself; I guess that's still something.

35. seunosewa ◴[] No.44013651[source]
You can use a pool of interpreter processes. You don't have to spawn one for each request.
36. bobmcnamara ◴[] No.44013686{3}[source]
Before we had MMUs and CoW, it was awful slow
37. westurner ◴[] No.44016081{6}[source]
Is there a good way to add __del__() methods or to wrap Context Manager __enter__()/__exit__() methods around objects that never needed them because of the gc?

Hadn't seen this:

  import gc
  gc.disable()
Cython has __dealloc__() instead of __del__()?
replies(1): >>44016148 #
38. westurner ◴[] No.44016148{7}[source]
Also, there's a recent proposal to add explicit resource management to JS: "JavaScript's New Superpower: Explicit Resource Management" https://news.ycombinator.com/item?id=44012227
39. duped ◴[] No.44030303{6}[source]
I strongly doubt this claim. Python is more than 64x slower than C without synchronization overhead in most numeric tasks, with synchronization overhead on those processes it should be much worse.

Python is so much slower than any native or JIT compiled language that it begets things like numpy in the first place.

replies(1): >>44075783 #
40. kragen ◴[] No.44075783{7}[source]
My typical experience is about 40×.