Most active commenters

kstrauser(6)
kragen(5)
charleshn(3)
zahlman(3)

Popular/hot comments

>>44007828 #
>>44010625 #

←back to thread

The first year of free-threaded Python

(labs.quansight.org)

Show context

sgarland ◴[16 May 25 12:56 UTC] No.44004897[source]▶

>>44003445 (OP) #

> Instead, many reach for multiprocessing, but spawning processes is expensive

Agreed.

> and communicating across processes often requires making expensive copies of data

SharedMemory [0] exists. Never understood why this isn’t used more frequently. There’s even a ShareableList which does exactly what it sounds like, and is awesome.

[0]: https://docs.python.org/3/library/multiprocessing.shared_mem...

replies(8): >>44004956 #>>44005006 #>>44006103 #>>44006145 #>>44006664 #>>44006670 #>>44007267 #>>44013159 #

1. chubot ◴[16 May 25 14:46 UTC] No.44006145[source]▶

>>44004897 #

Spawning processes generally takes much less than 1 ms on Unix

Spawning a PYTHON interpreter process might take 30 ms to 300 ms before you get to main(), depending on the number of imports

It's 1 to 2 orders of magnitude difference, so it's worth being precise

This is a fallacy with say CGI. A CGI in C, Rust, or Go works perfectly well.

e.g. sqlite.org runs with a process PER REQUEST - https://news.ycombinator.com/item?id=3036124

replies(9): >>44006287 #>>44007950 #>>44008877 #>>44009754 #>>44009755 #>>44009805 #>>44010011 #>>44012318 #>>44013651 #

2. Sharlin ◴[16 May 25 14:57 UTC] No.44006287[source]▶

>>44006145 (TP) #

Unix is not the only platform though (and is process creation fast on all Unices or just Linux?) The point about interpreter init overhead is, of course, apt.

replies(1): >>44007828 #

3. btilly ◴[16 May 25 17:19 UTC] No.44007828[source]▶

>>44006287 #

Process creation should be fast on all Unices. If it isn't, then the lowly shell script (heavily used in Unix) is going to perform very poorly.

replies(4): >>44009682 #>>44010066 #>>44012374 #>>44013686 #

4. charleshn ◴[16 May 25 17:31 UTC] No.44007950[source]▶

>>44006145 (TP) #

> Spawning processes generally takes much less than 1 ms on Unix

It depends on whether one uses clone, fork, posix_spawn etc.

Fork can take a while depending on the size of the address space, number of VMAs etc.

replies(2): >>44009524 #>>44009676 #

5. LPisGood ◴[16 May 25 19:14 UTC] No.44008877[source]▶

>>44006145 (TP) #

My understanding is that spawning a thread takes just a few micro seconds, so whether you’re talking about a process or a Python interpreter process there are still orders of magnitude to be gained.

6. crackez ◴[16 May 25 20:27 UTC] No.44009524[source]▶

>>44007950 #

Fork on Linux should use copy-on-write vmpages now, so if you fork inside python it should be cheap. If you launch a new Python process from let's say the shell, and it's already in the buffer cache, then you should only have to pay the startup CPU cost of the interpreter, since the IO should be satisfied from buffer cache...

replies(1): >>44010504 #

7. knome ◴[16 May 25 20:45 UTC] No.44009676[source]▶

>>44007950 #

for glibc and linux, fork just calls clone. as does posix_spawn, using the flag CLONE_VFORK.

8. kragen ◴[16 May 25 20:45 UTC] No.44009682{3}[source]▶

>>44007828 #

While I think you've been using Unix longer than I have, shell scripts are known for performing very poorly, and on PDP-11 Unix (where perhaps shell scripts were most heavily used, since Perl didn't exist yet) fork() couldn't even do copy-on-write; it had to literally copy the process's entire data segment, which in most cases also contained a copy of its code. Moving to paged machines like the VAX and especially the 68000 family made it possible to use copy-on-write, but historically speaking, Linux has often been an order of magnitude faster than most other Unices at fork(). However, I think people mostly don't use those Unices anymore. I imagine the BSDs have pretty much caught up by now.

https://news.ycombinator.com/item?id=44009754 gives some concrete details on fork() speed on current Linux: 50μs for a small process, 700μs for a regular process, 1300μs for a venti Python interpreter process, 30000–50000μs for Python interpreter creation. This is on a CPU of about 10 billion instructions per second per core, so forking costs on the order of ½–10 million instructions.

9. kragen ◴[16 May 25 20:55 UTC] No.44009754[source]▶

>>44006145 (TP) #

To be concrete about this, http://canonical.org/~kragen/sw/dev3/forkovh.c took 670μs to fork, exit, and wait on the first laptop I tried it on, but only 130μs compiled with dietlibc instead of glibc, and with glibc on a 2.3 GHz E5-2697 Xeon, it took 130μs compiled with glibc.

httpdito http://canonical.org/~kragen/sw/dev3/server.s (which launches a process per request) seems to take only about 50μs because it's not linked with any C library and therefore only maps 5 pages. Also, that doesn't include the time for exit() because it runs multiple concurrent child processes.

On this laptop, a Ryzen 5 3500U running at 2.9GHz, forkovh takes about 330μs built with glibc and about 130–140μs built with dietlibc, and `time python3 -c True` takes about 30000–50000μs. I wrote a Python version of forkovh http://canonical.org/~kragen/sw/dev3/forkovh.py and it takes about 1200μs to fork(), _exit(), and wait().

If anyone else wants to clone that repo and test their own machines, I'm interested to hear the results, especially if they aren't in Linux. `make forkovh` will compile the C version.

1200μs is pretty expensive in some contexts but not others. Certainly it's cheaper than spawning a new Python interpreter by more than an order of magnitude.

replies(1): >>44011765 #

10. jaoane ◴[16 May 25 20:55 UTC] No.44009755[source]▶

>>44006145 (TP) #

>Spawning a PYTHON interpreter process might take 30 ms to 300 ms before you get to main(), depending on the number of imports

That's lucky. On constrained systems launching a new interpreter can very well take 10 seconds. Python is ssssslllloooowwwww.

11. morningsam ◴[16 May 25 21:02 UTC] No.44009805[source]▶

>>44006145 (TP) #

>Spawning a PYTHON interpreter process might take 30 ms to 300 ms

Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter, which takes low-single-digit ms as well.

replies(2): >>44010280 #>>44011993 #

12. ori_b ◴[16 May 25 21:40 UTC] No.44010011[source]▶

>>44006145 (TP) #

As another example: I run https://shithub.us with shell scripts, serving a terabyte or so of data monthly (mostly due to AI crawlers that I can't be arsed to block).

I'm launching between 15 and 3000 processes per request. While Plan 9 is about 10x faster at spawning processes than Linux, it's telling that 3000 C processes launching in a shell is about as fast as one python interpreter.

replies(1): >>44010564 #

13. fredoralive ◴[16 May 25 21:48 UTC] No.44010066{3}[source]▶

>>44007828 #

Python runs on other operating systems, like NT, where AIUI processes are rather more heavyweight.

Not all use cases of Python and Windows intersect (how much web server stuff is a Windows / IIS / SQL Server / Python stack? Probably not many, although WISP is a nice acronym), but you’ve still got to bear it in mind for people doing heavy numpy stuff on their work laptop or whatever.

14. zahlman ◴[16 May 25 22:20 UTC] No.44010280[source]▶

>>44009805 #

Even when the 'spawn' strategy is used (default on Windows, and can be chosen explicitly on Linux), the overhead can largely be avoided. (Why choose it on Linux? Apparently forking can cause problems if you also use threads.) Python imports can be deferred (`import` is a statement, not a compiler or pre-processor directive), and child processes (regardless of the creation strategy) name the main module as `__mp_main__` rather than `__main__`, allowing the programmer to distinguish. (Being able to distinguish is of course necessary here, to avoid making a fork bomb - since the top-level code runs automatically and `if __name__ == '__main__':` is normally top-level code.)

But also keep in mind that cleanup for a Python process also takes time, which is harder to trace.

Refs:

https://docs.python.org/3/library/multiprocessing.html#conte... https://stackoverflow.com/questions/72497140

replies(1): >>44010625 #

15. charleshn ◴[16 May 25 22:51 UTC] No.44010504{3}[source]▶

>>44009524 #

> Fork on Linux should use copy-on-write vmpages now, so if you fork inside python it should be cheap.

No, that's exactly the point I'm making, copying PTEs is not cheap on a large address space, woth many VMAs.

You can run a simple python script allocating a large list and see how it affects fork time.

replies(1): >>44011430 #

16. kstrauser ◴[16 May 25 23:00 UTC] No.44010564[source]▶

>>44010011 #

The interpreter itself is pretty quick:

  ᐅ time echo "print('hi'); exit()" | python
  hi
  
  ________________________________________________________
  Executed in   21.48 millis    fish           external
     usr time   16.35 millis  146.00 micros   16.20 millis
     sys time    4.49 millis  593.00 micros    3.89 millis

replies(1): >>44011700 #

17. kstrauser ◴[16 May 25 23:11 UTC] No.44010625{3}[source]▶

>>44010280 #

I really wish Python had a way to annotate things you don't care about cleaning up. I don't know what the API would look like, but I imagine something like:

  l = list(cleanup=False)
  for i in range(1_000_000_000): l.append(i)

telling the runtime that we don't need to individually GC each of those tiny objects and just let the OS's process model free the whole thing at once.

Sure, close TCP connections before you kill the whole thing. I couldn't care less about most objects, though.

replies(4): >>44010738 #>>44010861 #>>44010911 #>>44012256 #

18. zahlman ◴[16 May 25 23:34 UTC] No.44010738{4}[source]▶

>>44010625 #

You'd presumably need to do something involving weakrefs, since it would be really bad if you told Python that the elements can be GCd at all (never mind whether it can be done all at once) but someone else had a reference.

Or completely rearchitect the language to have a model of automatic (in the C sense) allocation. I can't see that ever happening.

replies(1): >>44010924 #

19. duped ◴[16 May 25 23:56 UTC] No.44010861{4}[source]▶

>>44010625 #

Tbh if you're optimizing python code you've already lost

replies(2): >>44010925 #>>44011731 #

20. Izkata ◴[17 May 25 00:07 UTC] No.44010911{4}[source]▶

>>44010625 #

There's already a global:

  import gc
  gc.disable()

So I imagine putting more in there to remove objects from the tracking.

replies(1): >>44010998 #

21. kstrauser ◴[17 May 25 00:09 UTC] No.44010924{5}[source]▶

>>44010738 #

I don't think either of those are true. I'm not arguing against cleaning up objects during the normal runtime. What I'd like is something that would avoid GC'ing objects one-at-a-time at program shutdown.

I've had cases where it took Python like 30 seconds to exit after I'd slurped a large CSV with a zillion rows into RAM. At that time, I'd dreamed of a way to tell Python not to bother free()ing any of that, just exit() and let Linux unmap RAM all at once. If you think about it, there probably aren't that many resources you actually care about individually freeing on exit. I'm certain somewill will prove me wrong, but at a first pass, objects that don't define __del__ or __exit__ probably don't care how you destroy them.

replies(1): >>44011521 #

22. kstrauser ◴[17 May 25 00:09 UTC] No.44010925{5}[source]▶

>>44010861 #

Run along.

23. kstrauser ◴[17 May 25 00:21 UTC] No.44010998{5}[source]▶

>>44010911 #

That can go a long way, so long as you remember to manually GC the handful of things you do care about.

replies(2): >>44013519 #>>44016081 #

24. charleshn ◴[17 May 25 01:48 UTC] No.44011430{4}[source]▶

>>44010504 #

See e.g. https://www.alibabacloud.com/blog/async-fork-mitigating-quer...

25. zahlman ◴[17 May 25 02:11 UTC] No.44011521{6}[source]▶

>>44010924 #

Ah.

I imagine the problem is that `__del__` could be monkeypatched, so Python doesn't strictly know what needs custom finalization until that moment.

But if you have a concrete proposal, it's likely worth shopping around at https://discuss.python.org/c/ideas/6 or https://github.com/python/cpython/issues/ .

replies(1): >>44011632 #

26. kstrauser ◴[17 May 25 02:39 UTC] No.44011632{7}[source]▶

>>44011521 #

I might do that. It’s nothing I’ve thought about in depth, just an occasionally recurring idea that bugs me every now and then.

27. ori_b ◴[17 May 25 02:57 UTC] No.44011700{3}[source]▶

>>44010564 #

My machine is marginally faster for that; I get about 17ms doing that with python, without the print:

    time echo "exit()" | python3

    real    0m0.017s
    user    0m0.014s
    sys     0m0.003s

That's... still pretty slow. Here's a C program, run 100 times:

    range=`seq 100`
    time for i in $range; do ./a.out; done

    real    0m0.038s
    user    0m0.024s
    sys     0m0.016s

And finally, for comparison on Plan 9:

   range=`{seq 2000}
   time rc -c 'for(s in $range){ ./6.nop }'

   0.01u 0.09s 0.16r   rc -c for(s in $range){ ./6.nop }

the C program used was simply:

   int main(void) { return 0; }

Of course, the more real work you do in the program, the less it matters -- but there's a hell of a lot you can do in the time it takes Python to launch.

28. kragen ◴[17 May 25 03:01 UTC] No.44011731{5}[source]▶

>>44010861 #

On a 64-core machine, Python code that uses all the cores will be modestly faster than single-threaded C, even if all the inner loops are in Python. If you can move the inner loops to C, for example with Numpy, you can do much better still. (Python is still harder to get right than something like C or OCaml, of course, especially for larger programs, but often the smaller amount of code and quicker feedback loop can compensate for that.)

replies(1): >>44030303 #

29. kragen ◴[17 May 25 03:12 UTC] No.44011765[source]▶

>>44009754 #

On my cellphone forkovh is 700μs and forkovh.py is 3700μs. Qualcomm SDM630. All the forkovh numbers are with 102400 bytes of data.

30. codethief ◴[17 May 25 04:13 UTC] No.44011993[source]▶

>>44009805 #

> Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter

…which can also be a great source of subtle bugs if you're writing a cross-platform application.

31. Too ◴[17 May 25 05:33 UTC] No.44012256{4}[source]▶

>>44010625 #

Never experienced this. If this is truly a problem, here is a sledgehammer, just beware it will not close your tcp connections gracefully: os.kill(os.getpid(), signal.SIGKILL).

32. Too ◴[17 May 25 05:54 UTC] No.44012318[source]▶

>>44006145 (TP) #

In Python, if you are spawning processes or even threads in a tight loop you have already lost. Use ThreadPoolExecutor or ProcessPoolExecutor from concurrent.futures instead. Then startup time becomes no factor.

33. saagarjha ◴[17 May 25 06:17 UTC] No.44012374{3}[source]▶

>>44007828 #

Yes, this is why using shell scripts on macOS is miserable

34. MonkeyClub ◴[17 May 25 11:26 UTC] No.44013519{6}[source]▶

>>44010998 #

And then we're back to manual memory management.

At least the objects get instantiated automatically, and you don't need to malloc() them into existence yourself; I guess that's still something.

35. seunosewa ◴[17 May 25 11:55 UTC] No.44013651[source]▶

>>44006145 (TP) #

You can use a pool of interpreter processes. You don't have to spawn one for each request.

36. bobmcnamara ◴[17 May 25 12:05 UTC] No.44013686{3}[source]▶

>>44007828 #

Before we had MMUs and CoW, it was awful slow

37. westurner ◴[17 May 25 18:29 UTC] No.44016081{6}[source]▶

>>44010998 #

Is there a good way to add __del__() methods or to wrap Context Manager __enter__()/__exit__() methods around objects that never needed them because of the gc?

Hadn't seen this:

  import gc
  gc.disable()

Cython has __dealloc__() instead of __del__()?

replies(1): >>44016148 #

38. westurner ◴[17 May 25 18:40 UTC] No.44016148{7}[source]▶

>>44016081 #

Also, there's a recent proposal to add explicit resource management to JS: "JavaScript's New Superpower: Explicit Resource Management" https://news.ycombinator.com/item?id=44012227

39. duped ◴[19 May 25 14:31 UTC] No.44030303{6}[source]▶

>>44011731 #

I strongly doubt this claim. Python is more than 64x slower than C without synchronization overhead in most numeric tasks, with synchronization overhead on those processes it should be much worse.

Python is so much slower than any native or JIT compiled language that it begets things like numpy in the first place.

replies(1): >>44075783 #

40. kragen ◴[23 May 25 19:25 UTC] No.44075783{7}[source]▶

>>44030303 #

My typical experience is about 40×.

↑