Most active commenters
  • kstrauser(6)
  • frollogaston(5)
  • tomrod(4)
  • kragen(4)
  • zahlman(4)
  • sgarland(3)
  • tinix(3)
  • charleshn(3)
  • notpushkin(3)

←back to thread

256 points rbanffy | 67 comments | | HN request time: 1.287s | source | bottom
1. sgarland ◴[] No.44004897[source]
> Instead, many reach for multiprocessing, but spawning processes is expensive

Agreed.

> and communicating across processes often requires making expensive copies of data

SharedMemory [0] exists. Never understood why this isn’t used more frequently. There’s even a ShareableList which does exactly what it sounds like, and is awesome.

[0]: https://docs.python.org/3/library/multiprocessing.shared_mem...

replies(7): >>44004956 #>>44005006 #>>44006103 #>>44006145 #>>44006664 #>>44006670 #>>44007267 #
2. ogrisel ◴[] No.44005006[source]
You cannot share arbitrarily structured objects in the `ShareableList`, only atomic scalars and bytes / strings.

If you want to share structured Python objects between instances, you have to pay the cost of `pickle.dump/pickle.dump` (CPU overhead for interprocess communication) + the memory cost of replicated objects in the processes.

replies(3): >>44006004 #>>44008341 #>>44010473 #
3. tomrod ◴[] No.44006004[source]
I can fit a lot of json into bytes/strings though?
replies(4): >>44006041 #>>44006052 #>>44007146 #>>44008154 #
4. cjbgkagh ◴[] No.44006041{3}[source]
Perhaps flatbuffers would be better?
replies(2): >>44006072 #>>44007279 #
5. vlovich123 ◴[] No.44006052{3}[source]
That’s even worse than pickle.
replies(1): >>44006078 #
6. tomrod ◴[] No.44006072{4}[source]
I love learning from folks on HN -- thanks! Will check it out.
replies(1): >>44008294 #
7. tomrod ◴[] No.44006078{4}[source]
pickle pickles to pickle binary, yeah? So can stream that too with an io Buffer :D
8. modeless ◴[] No.44006103[source]
Yeah I've had great success sharing numpy arrays this way. Explicit sharing is not a huge burden, especially when compared with the difficulty of debugging problems that occur when you accidentally share things between threads. People vastly overstate the benefit of threads over multiprocessing and I don't look forward to all the random segfaults I'm going to have to debug after people start routinely disabling the GIL in a library ecosystem that isn't ready.

I wonder why people never complained so much about JavaScript not having shared-everything threading. Maybe because JavaScript is so much faster that you don't have to reach for it as much. I wish more effort was put into baseline performance for Python.

replies(5): >>44007514 #>>44007861 #>>44010315 #>>44010354 #>>44011843 #
9. chubot ◴[] No.44006145[source]
Spawning processes generally takes much less than 1 ms on Unix

Spawning a PYTHON interpreter process might take 30 ms to 300 ms before you get to main(), depending on the number of imports

It's 1 to 2 orders of magnitude difference, so it's worth being precise

This is a fallacy with say CGI. A CGI in C, Rust, or Go works perfectly well.

e.g. sqlite.org runs with a process PER REQUEST - https://news.ycombinator.com/item?id=3036124

replies(8): >>44006287 #>>44007950 #>>44008877 #>>44009754 #>>44009755 #>>44009805 #>>44010011 #>>44012318 #
10. Sharlin ◴[] No.44006287[source]
Unix is not the only platform though (and is process creation fast on all Unices or just Linux?) The point about interpreter init overhead is, of course, apt.
replies(1): >>44007828 #
11. isignal ◴[] No.44006670[source]
Processes can die independently so the state of a concurrent shared memory data structure when a process dies while modifying this under a lock can be difficult to manage. Postgres which uses shared memory data structures can sometimes need to kill all its backend processes because it cannot fully recover from such a state.

In contrast, no one thinks about what happens if a thread dies independently because the failure mode is joint.

replies(2): >>44007457 #>>44007632 #
12. frollogaston ◴[] No.44007146{3}[source]
If all your state is already json-serializable, yeah. But that's just as expensive as copying if not more, hence what cjbgkagh said about flatbuffers.
replies(1): >>44010306 #
13. tinix ◴[] No.44007267[source]
shared memory only works on dedicated hardware.

if you're running in something like AWS fargate, there is no shared memory. have to use the network and file system which adds a lot of latency, way more than spawning a process.

copying processes through fork is a whole different problem.

green threads and an actor model will get you much further in my experience.

replies(2): >>44008492 #>>44010425 #
14. tinix ◴[] No.44007279{4}[source]
let me introduce you to quickle.
15. wongarsu ◴[] No.44007457[source]
> In contrast, no one thinks about what happens if a thread dies independently because the failure mode is joint.

In Rust if a thread holding a mutex dies the mutex becomes poisoned, and trying to acquire it leads to an error that has to be handled. As a consequence every rust developer that touches a mutex has to think about that failure mode. Even if in 95% of cases the best answer is "let's exit when that happens".

The operating system tends to treat your whole process as one and shot down everything or nothing. But a thread can still crash in its own due to unhandled oom, assertion failures or any number of other issues

replies(1): >>44008025 #
16. dhruvrajvanshi ◴[] No.44007514[source]
> I wonder why people never complained so much about JavaScript not having shared-everything threading. Maybe because JavaScript is so much faster that you don't have to reach for it as much. I wish more effort was put into baseline performance for Python.

This is a fair observation.

I think a part of the problem is that the things that make GIL less python hard are also the things that make faster baseline performance hard. I.e. an over reliance of the ecosystem on the shape of the CPython data structures.

What makes python different is that a large percentage of python code isn't python, but C code targeting the CPython api. This isn't true for a lot of other interpreted languages.

17. jcalvinowens ◴[] No.44007632[source]
This is a solvable problem though, the literature is overflowing with lock-free implementations of common data structures. The real question is how much performance you have to sacrifice for the guarantee...
18. btilly ◴[] No.44007828{3}[source]
Process creation should be fast on all Unices. If it isn't, then the lowly shell script (heavily used in Unix) is going to perform very poorly.
replies(2): >>44009682 #>>44010066 #
19. com2kid ◴[] No.44007861[source]
> I wonder why people never complained so much about JavaScript not having shared-everything threading. Maybe because JavaScript is so much faster that you don't have to reach for it as much. I wish more effort was put into baseline performance for Python.

Nobody sane tries to do math in JS. Backend JS is recommended for situations where processing is minimal and it is mostly lots of tiny IO requests that need to be shunted around.

I'm a huge JS/Node proponent and if someone says they need to write a backend service that crunches a lot of numbers, I'll recommend choosing a different technology!

For some reason Python peeps keep trying to do actual computations in Python...

replies(1): >>44010362 #
20. charleshn ◴[] No.44007950[source]
> Spawning processes generally takes much less than 1 ms on Unix

It depends on whether one uses clone, fork, posix_spawn etc.

Fork can take a while depending on the size of the address space, number of VMAs etc.

replies(2): >>44009524 #>>44009676 #
21. jcalvinowens ◴[] No.44008025{3}[source]
> But a thread can still crash in its own due to unhandled oom, assertion failures or any number of other issues

That's not really true on POSIX. Unless you're doing nutty things with clone(), or you actually have explicit code that calls pthread_exit() or gettid()/pthread_kill(), the whole process is always going to die at the same time.

POSIX signal dispositions are process-wide, the only way e.g. SIGSEGV kills a single thread is if you write an explicit handler which actually does that by hand. Unhandled exceptions usually SIGABRT, which works the same way.

** Just to expand a bit: there is a subtlety in that, while dispositions are process-wide, one individual thread does indeed take the signal. If the signal is handled, only that thread sees -EINTR from a blocking syscall; but if the signal is not handled, the default disposition affects all threads in the process simultaneously no matter which thread is actually signalled.

replies(1): >>44009232 #
22. reliabilityguy ◴[] No.44008154{3}[source]
What’s the point? The whole idea is to share an object, and not to serialize them whether it’s json, pickle, or whatever.
replies(1): >>44008726 #
23. notpushkin ◴[] No.44008294{5}[source]
Take a look at https://capnproto.org/ as well, while at it.

Neither solve the copying problem, though.

replies(2): >>44010278 #>>44010304 #
24. notpushkin ◴[] No.44008341[source]
We need a dataclass-like interface on top of a ShareableList.
replies(1): >>44010745 #
25. bradleybuda ◴[] No.44008492[source]
Fargate is just a container runtime. You can fork processes and share memory like you can in any other Linux environment. You may not want to (because you are running many cheap / small containers) but if your Fargate containers are running 0.25 vCPUs then you probably don't want traditional multiprocessing or multithreading...
replies(1): >>44010396 #
26. tomrod ◴[] No.44008726{4}[source]
I mean, the answer to this is pretty straightforward -- because we can, not because we should :)
27. LPisGood ◴[] No.44008877[source]
My understanding is that spawning a thread takes just a few micro seconds, so whether you’re talking about a process or a Python interpreter process there are still orders of magnitude to be gained.
28. wahern ◴[] No.44009232{4}[source]
It would be nice if someday we got per-thread signal handlers to complement per-thread signal masking and per-thread alternate signal stacks.
29. crackez ◴[] No.44009524{3}[source]
Fork on Linux should use copy-on-write vmpages now, so if you fork inside python it should be cheap. If you launch a new Python process from let's say the shell, and it's already in the buffer cache, then you should only have to pay the startup CPU cost of the interpreter, since the IO should be satisfied from buffer cache...
replies(1): >>44010504 #
30. knome ◴[] No.44009676{3}[source]
for glibc and linux, fork just calls clone. as does posix_spawn, using the flag CLONE_VFORK.
31. kragen ◴[] No.44009682{4}[source]
While I think you've been using Unix longer than I have, shell scripts are known for performing very poorly, and on PDP-11 Unix (where perhaps shell scripts were most heavily used, since Perl didn't exist yet) fork() couldn't even do copy-on-write; it had to literally copy the process's entire data segment, which in most cases also contained a copy of its code. Moving to paged machines like the VAX and especially the 68000 family made it possible to use copy-on-write, but historically speaking, Linux has often been an order of magnitude faster than most other Unices at fork(). However, I think people mostly don't use those Unices anymore. I imagine the BSDs have pretty much caught up by now.

https://news.ycombinator.com/item?id=44009754 gives some concrete details on fork() speed on current Linux: 50μs for a small process, 700μs for a regular process, 1300μs for a venti Python interpreter process, 30000–50000μs for Python interpreter creation. This is on a CPU of about 10 billion instructions per second per core, so forking costs on the order of ½–10 million instructions.

32. kragen ◴[] No.44009754[source]
To be concrete about this, http://canonical.org/~kragen/sw/dev3/forkovh.c took 670μs to fork, exit, and wait on the first laptop I tried it on, but only 130μs compiled with dietlibc instead of glibc, and with glibc on a 2.3 GHz E5-2697 Xeon, it took 130μs compiled with glibc.

httpdito http://canonical.org/~kragen/sw/dev3/server.s (which launches a process per request) seems to take only about 50μs because it's not linked with any C library and therefore only maps 5 pages. Also, that doesn't include the time for exit() because it runs multiple concurrent child processes.

On this laptop, a Ryzen 5 3500U running at 2.9GHz, forkovh takes about 330μs built with glibc and about 130–140μs built with dietlibc, and `time python3 -c True` takes about 30000–50000μs. I wrote a Python version of forkovh http://canonical.org/~kragen/sw/dev3/forkovh.py and it takes about 1200μs to fork(), _exit(), and wait().

If anyone else wants to clone that repo and test their own machines, I'm interested to hear the results, especially if they aren't in Linux. `make forkovh` will compile the C version.

1200μs is pretty expensive in some contexts but not others. Certainly it's cheaper than spawning a new Python interpreter by more than an order of magnitude.

replies(1): >>44011765 #
33. jaoane ◴[] No.44009755[source]
>Spawning a PYTHON interpreter process might take 30 ms to 300 ms before you get to main(), depending on the number of imports

That's lucky. On constrained systems launching a new interpreter can very well take 10 seconds. Python is ssssslllloooowwwww.

34. morningsam ◴[] No.44009805[source]
>Spawning a PYTHON interpreter process might take 30 ms to 300 ms

Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter, which takes low-single-digit ms as well.

replies(2): >>44010280 #>>44011993 #
35. ori_b ◴[] No.44010011[source]
As another example: I run https://shithub.us with shell scripts, serving a terabyte or so of data monthly (mostly due to AI crawlers that I can't be arsed to block).

I'm launching between 15 and 3000 processes per request. While Plan 9 is about 10x faster at spawning processes than Linux, it's telling that 3000 C processes launching in a shell is about as fast as one python interpreter.

replies(1): >>44010564 #
36. fredoralive ◴[] No.44010066{4}[source]
Python runs on other operating systems, like NT, where AIUI processes are rather more heavyweight.

Not all use cases of Python and Windows intersect (how much web server stuff is a Windows / IIS / SQL Server / Python stack? Probably not many, although WISP is a nice acronym), but you’ve still got to bear it in mind for people doing heavy numpy stuff on their work laptop or whatever.

37. ◴[] No.44010278{6}[source]
38. zahlman ◴[] No.44010280{3}[source]
Even when the 'spawn' strategy is used (default on Windows, and can be chosen explicitly on Linux), the overhead can largely be avoided. (Why choose it on Linux? Apparently forking can cause problems if you also use threads.) Python imports can be deferred (`import` is a statement, not a compiler or pre-processor directive), and child processes (regardless of the creation strategy) name the main module as `__mp_main__` rather than `__main__`, allowing the programmer to distinguish. (Being able to distinguish is of course necessary here, to avoid making a fork bomb - since the top-level code runs automatically and `if __name__ == '__main__':` is normally top-level code.)

But also keep in mind that cleanup for a Python process also takes time, which is harder to trace.

Refs:

https://docs.python.org/3/library/multiprocessing.html#conte... https://stackoverflow.com/questions/72497140

replies(1): >>44010625 #
39. frollogaston ◴[] No.44010304{6}[source]
Ah, I forgot capnproto doesn't let you edit a serialized proto in-memory, it's read-only. In theory this should be possible as long as you're not changing the length of anything, but I'm not surprised such trickery is unsupported.

So this doesn't seem like a versatile solution for sharing data structs between two Python processes. You're gonna have to reserialize the whole thing if one side wants to edit, which is basically copying.

40. frollogaston ◴[] No.44010306{4}[source]
oh nvm, that doesn't solve this either
41. zahlman ◴[] No.44010315[source]
> I wish more effort was put into baseline performance for Python.

There has been. That's why the bytecode is incompatible between minor versions. It was a major selling(?) point for 3.11 and 3.12 in particular.

But the "Faster CPython" team at Microsoft was apparently just laid off (https://www.linkedin.com/posts/mdboom_its-been-a-tough-coupl...), and all of the optimization work has to my understanding been based around fairly traditional techniques. The C part of the codebase has decades of legacy to it, after all.

Alternative implementations like PyPy often post impressive results, and are worth checking out if you need to worry about native Python performance. Not to mention the benefits of shifting the work onto compiled code like NumPy, as you already do.

replies(1): >>44011991 #
42. frollogaston ◴[] No.44010354[source]
"I wonder why people never complained so much about JavaScript not having shared-everything threading"

Mainly cause Python is often used for data pipelines in ways that JS isn't, causing situations where you do want to use multiple CPU cores with some shared memory. If you want to use multiple CPU cores in NodeJS, usually it's just a load-balancing webserver without IPC and you just use throng, or maybe you've got microservices.

Also, JS parallelism simply excelled from the start at waiting on tons of IO, there was no confusion about it. Python later got asyncio for this, and by now regular threads have too much momentum. Threads are the worst of both worlds in Py, cause you get the overhead of an OS thread and the possibility of race conditions without the full parallelism it's supposed to buy you. And all this stuff is confusing to users.

43. frollogaston ◴[] No.44010362{3}[source]
Python peeps tend to do heavy numbers calc in numpy, but sometimes you're doing expensive things with dictionaries/lists.
44. tinix ◴[] No.44010396{3}[source]
Go try it and report back.

Fargate isn't just ECS and plain containers.

You cannot use shared memory in fargate, there is literally no /dev/shm.

See "sharedMemorySize" here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/...

> If you're using tasks that use the Fargate launch type, the sharedMemorySize parameter isn't supported.

45. sgarland ◴[] No.44010425[source]
Well don’t use Fargate, there’s your problem. Run programs on actual servers, not magical serverless bullshit.
46. sgarland ◴[] No.44010473[source]
So don’t do that? Send data to workers as primitives, and have a separate process that reads the results and serializes it into whatever form you want.
47. charleshn ◴[] No.44010504{4}[source]
> Fork on Linux should use copy-on-write vmpages now, so if you fork inside python it should be cheap.

No, that's exactly the point I'm making, copying PTEs is not cheap on a large address space, woth many VMAs.

You can run a simple python script allocating a large list and see how it affects fork time.

replies(1): >>44011430 #
48. kstrauser ◴[] No.44010564{3}[source]
The interpreter itself is pretty quick:

  ᐅ time echo "print('hi'); exit()" | python
  hi
  
  ________________________________________________________
  Executed in   21.48 millis    fish           external
     usr time   16.35 millis  146.00 micros   16.20 millis
     sys time    4.49 millis  593.00 micros    3.89 millis
replies(1): >>44011700 #
49. kstrauser ◴[] No.44010625{4}[source]
I really wish Python had a way to annotate things you don't care about cleaning up. I don't know what the API would look like, but I imagine something like:

  l = list(cleanup=False)
  for i in range(1_000_000_000): l.append(i)
telling the runtime that we don't need to individually GC each of those tiny objects and just let the OS's process model free the whole thing at once.

Sure, close TCP connections before you kill the whole thing. I couldn't care less about most objects, though.

replies(4): >>44010738 #>>44010861 #>>44010911 #>>44012256 #
50. zahlman ◴[] No.44010738{5}[source]
You'd presumably need to do something involving weakrefs, since it would be really bad if you told Python that the elements can be GCd at all (never mind whether it can be done all at once) but someone else had a reference.

Or completely rearchitect the language to have a model of automatic (in the C sense) allocation. I can't see that ever happening.

replies(1): >>44010924 #
51. notpushkin ◴[] No.44010745{3}[source]
Actually, ShareableList feels like a tuple really (as it’s impossible to change its length). If we could mix ShareableList and collections.namedtuple together, it would get us 90% there (99.9% if we use typing.NamedTuple). Unfortunately, I can’t decipher either one [1, 2] from the first glance – maybe if I get some more sleep?

[1]: https://github.com/python/cpython/blob/3.13/Lib/collections/...

[2]: https://github.com/python/cpython/blob/3.13/Lib/typing.py#L2...

52. duped ◴[] No.44010861{5}[source]
Tbh if you're optimizing python code you've already lost
replies(2): >>44010925 #>>44011731 #
53. Izkata ◴[] No.44010911{5}[source]
There's already a global:

  import gc
  gc.disable()
So I imagine putting more in there to remove objects from the tracking.
replies(1): >>44010998 #
54. kstrauser ◴[] No.44010924{6}[source]
I don't think either of those are true. I'm not arguing against cleaning up objects during the normal runtime. What I'd like is something that would avoid GC'ing objects one-at-a-time at program shutdown.

I've had cases where it took Python like 30 seconds to exit after I'd slurped a large CSV with a zillion rows into RAM. At that time, I'd dreamed of a way to tell Python not to bother free()ing any of that, just exit() and let Linux unmap RAM all at once. If you think about it, there probably aren't that many resources you actually care about individually freeing on exit. I'm certain somewill will prove me wrong, but at a first pass, objects that don't define __del__ or __exit__ probably don't care how you destroy them.

replies(1): >>44011521 #
55. kstrauser ◴[] No.44010925{6}[source]
Run along.
56. kstrauser ◴[] No.44010998{6}[source]
That can go a long way, so long as you remember to manually GC the handful of things you do care about.
57. charleshn ◴[] No.44011430{5}[source]
See e.g. https://www.alibabacloud.com/blog/async-fork-mitigating-quer...
58. zahlman ◴[] No.44011521{7}[source]
Ah.

I imagine the problem is that `__del__` could be monkeypatched, so Python doesn't strictly know what needs custom finalization until that moment.

But if you have a concrete proposal, it's likely worth shopping around at https://discuss.python.org/c/ideas/6 or https://github.com/python/cpython/issues/ .

replies(1): >>44011632 #
59. kstrauser ◴[] No.44011632{8}[source]
I might do that. It’s nothing I’ve thought about in depth, just an occasionally recurring idea that bugs me every now and then.
60. ori_b ◴[] No.44011700{4}[source]
My machine is marginally faster for that; I get about 17ms doing that with python, without the print:

    time echo "exit()" | python3

    real    0m0.017s
    user    0m0.014s
    sys     0m0.003s
That's... still pretty slow. Here's a C program, run 100 times:

    range=`seq 100`
    time for i in $range; do ./a.out; done

    real    0m0.038s
    user    0m0.024s
    sys     0m0.016s
And finally, for comparison on Plan 9:

   range=`{seq 2000}
   time rc -c 'for(s in $range){ ./6.nop }'

   0.01u 0.09s 0.16r   rc -c for(s in $range){ ./6.nop }
the C program used was simply:

   int main(void) { return 0; }
Of course, the more real work you do in the program, the less it matters -- but there's a hell of a lot you can do in the time it takes Python to launch.
61. kragen ◴[] No.44011731{6}[source]
On a 64-core machine, Python code that uses all the cores will be modestly faster than single-threaded C, even if all the inner loops are in Python. If you can move the inner loops to C, for example with Numpy, you can do much better still. (Python is still harder to get right than something like C or OCaml, of course, especially for larger programs, but often the smaller amount of code and quicker feedback loop can compensate for that.)
62. kragen ◴[] No.44011765{3}[source]
On my cellphone forkovh is 700μs and forkovh.py is 3700μs. Qualcomm SDM630. All the forkovh numbers are with 102400 bytes of data.
63. monkeyelite ◴[] No.44011843[source]
> I wonder why people never complained so much about JavaScript not having shared-everything threading

Because it greatly simplifies the language and gives you all kinds of invariants.

64. csense ◴[] No.44011991{3}[source]
Yeah, when I'm having Python performance issues, my first instinct is to reach for Pypy. My second instinct is to rewrite the "hot" part in C or Rust.
65. codethief ◴[] No.44011993{3}[source]
> Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter

…which can also be a great source of subtle bugs if you're writing a cross-platform application.

66. Too ◴[] No.44012256{5}[source]
Never experienced this. If this is truly a problem, here is a sledgehammer, just beware it will not close your tcp connections gracefully: os.kill(os.getpid(), signal.SIGKILL).
67. Too ◴[] No.44012318[source]
In Python, if you are spawning processes or even threads in a tight loop you have already lost. Use ThreadPoolExecutor or ProcessPoolExecutor from concurrent.futures instead. Then startup time becomes no factor.