Most active commenters

kstrauser(6)
frollogaston(5)
kragen(5)
tomrod(4)
tinix(4)
zahlman(4)
sgarland(3)
jcalvinowens(3)
charleshn(3)
notpushkin(3)

Popular/hot comments

>>44006145 #
>>44006103 #
>>44006004 #
>>44007828 #
>>44010625 #
>>44005006 #

←back to thread

The first year of free-threaded Python

(labs.quansight.org)

1. sgarland ◴[16 May 25 12:56 UTC] No.44004897[source]▶

>>44003445 (OP) #

> Instead, many reach for multiprocessing, but spawning processes is expensive

Agreed.

> and communicating across processes often requires making expensive copies of data

SharedMemory [0] exists. Never understood why this isn’t used more frequently. There’s even a ShareableList which does exactly what it sounds like, and is awesome.

[0]: https://docs.python.org/3/library/multiprocessing.shared_mem...

replies(8): >>44004956 #>>44005006 #>>44006103 #>>44006145 #>>44006664 #>>44006670 #>>44007267 #>>44013159 #

2. ogrisel ◴[16 May 25 13:05 UTC] No.44005006[source]▶

>>44004897 (TP) #

You cannot share arbitrarily structured objects in the `ShareableList`, only atomic scalars and bytes / strings.

If you want to share structured Python objects between instances, you have to pay the cost of `pickle.dump/pickle.dump` (CPU overhead for interprocess communication) + the memory cost of replicated objects in the processes.

replies(3): >>44006004 #>>44008341 #>>44010473 #

3. tomrod ◴[16 May 25 14:34 UTC] No.44006004[source]▶

>>44005006 #

I can fit a lot of json into bytes/strings though?

replies(4): >>44006041 #>>44006052 #>>44007146 #>>44008154 #

4. cjbgkagh ◴[16 May 25 14:37 UTC] No.44006041{3}[source]▶

>>44006004 #

Perhaps flatbuffers would be better?

replies(2): >>44006072 #>>44007279 #

5. vlovich123 ◴[16 May 25 14:38 UTC] No.44006052{3}[source]▶

>>44006004 #

That’s even worse than pickle.

replies(1): >>44006078 #

6. tomrod ◴[16 May 25 14:40 UTC] No.44006072{4}[source]▶

>>44006041 #

I love learning from folks on HN -- thanks! Will check it out.

replies(1): >>44008294 #

7. tomrod ◴[16 May 25 14:41 UTC] No.44006078{4}[source]▶

>>44006052 #

pickle pickles to pickle binary, yeah? So can stream that too with an io Buffer :D

8. modeless ◴[16 May 25 14:42 UTC] No.44006103[source]▶

>>44004897 (TP) #

Yeah I've had great success sharing numpy arrays this way. Explicit sharing is not a huge burden, especially when compared with the difficulty of debugging problems that occur when you accidentally share things between threads. People vastly overstate the benefit of threads over multiprocessing and I don't look forward to all the random segfaults I'm going to have to debug after people start routinely disabling the GIL in a library ecosystem that isn't ready.

I wonder why people never complained so much about JavaScript not having shared-everything threading. Maybe because JavaScript is so much faster that you don't have to reach for it as much. I wish more effort was put into baseline performance for Python.

replies(5): >>44007514 #>>44007861 #>>44010315 #>>44010354 #>>44011843 #

9. chubot ◴[16 May 25 14:46 UTC] No.44006145[source]▶

>>44004897 (TP) #

Spawning processes generally takes much less than 1 ms on Unix

Spawning a PYTHON interpreter process might take 30 ms to 300 ms before you get to main(), depending on the number of imports

It's 1 to 2 orders of magnitude difference, so it's worth being precise

This is a fallacy with say CGI. A CGI in C, Rust, or Go works perfectly well.

e.g. sqlite.org runs with a process PER REQUEST - https://news.ycombinator.com/item?id=3036124

replies(9): >>44006287 #>>44007950 #>>44008877 #>>44009754 #>>44009755 #>>44009805 #>>44010011 #>>44012318 #>>44013651 #

10. Sharlin ◴[16 May 25 14:57 UTC] No.44006287[source]▶

>>44006145 #

Unix is not the only platform though (and is process creation fast on all Unices or just Linux?) The point about interpreter init overhead is, of course, apt.

replies(1): >>44007828 #

11. isignal ◴[16 May 25 15:30 UTC] No.44006670[source]▶

>>44004897 (TP) #

Processes can die independently so the state of a concurrent shared memory data structure when a process dies while modifying this under a lock can be difficult to manage. Postgres which uses shared memory data structures can sometimes need to kill all its backend processes because it cannot fully recover from such a state.

In contrast, no one thinks about what happens if a thread dies independently because the failure mode is joint.

replies(2): >>44007457 #>>44007632 #

12. frollogaston ◴[16 May 25 16:12 UTC] No.44007146{3}[source]▶

>>44006004 #

If all your state is already json-serializable, yeah. But that's just as expensive as copying if not more, hence what cjbgkagh said about flatbuffers.

replies(1): >>44010306 #

13. tinix ◴[16 May 25 16:24 UTC] No.44007267[source]▶

>>44004897 (TP) #

shared memory only works on dedicated hardware.

if you're running in something like AWS fargate, there is no shared memory. have to use the network and file system which adds a lot of latency, way more than spawning a process.

copying processes through fork is a whole different problem.

green threads and an actor model will get you much further in my experience.

replies(2): >>44008492 #>>44010425 #

14. tinix ◴[16 May 25 16:25 UTC] No.44007279{4}[source]▶

>>44006041 #

let me introduce you to quickle.

15. wongarsu ◴[16 May 25 16:42 UTC] No.44007457[source]▶

>>44006670 #

> In contrast, no one thinks about what happens if a thread dies independently because the failure mode is joint.

In Rust if a thread holding a mutex dies the mutex becomes poisoned, and trying to acquire it leads to an error that has to be handled. As a consequence every rust developer that touches a mutex has to think about that failure mode. Even if in 95% of cases the best answer is "let's exit when that happens".

The operating system tends to treat your whole process as one and shot down everything or nothing. But a thread can still crash in its own due to unhandled oom, assertion failures or any number of other issues

replies(2): >>44008025 #>>44016003 #

16. dhruvrajvanshi ◴[16 May 25 16:48 UTC] No.44007514[source]▶

>>44006103 #

> I wonder why people never complained so much about JavaScript not having shared-everything threading. Maybe because JavaScript is so much faster that you don't have to reach for it as much. I wish more effort was put into baseline performance for Python.

This is a fair observation.

I think a part of the problem is that the things that make GIL less python hard are also the things that make faster baseline performance hard. I.e. an over reliance of the ecosystem on the shape of the CPython data structures.

What makes python different is that a large percentage of python code isn't python, but C code targeting the CPython api. This isn't true for a lot of other interpreted languages.

17. jcalvinowens ◴[16 May 25 16:59 UTC] No.44007632[source]▶

>>44006670 #

This is a solvable problem though, the literature is overflowing with lock-free implementations of common data structures. The real question is how much performance you have to sacrifice for the guarantee...

18. btilly ◴[16 May 25 17:19 UTC] No.44007828{3}[source]▶

>>44006287 #

Process creation should be fast on all Unices. If it isn't, then the lowly shell script (heavily used in Unix) is going to perform very poorly.

replies(4): >>44009682 #>>44010066 #>>44012374 #>>44013686 #

19. com2kid ◴[16 May 25 17:22 UTC] No.44007861[source]▶

>>44006103 #

Nobody sane tries to do math in JS. Backend JS is recommended for situations where processing is minimal and it is mostly lots of tiny IO requests that need to be shunted around.

I'm a huge JS/Node proponent and if someone says they need to write a backend service that crunches a lot of numbers, I'll recommend choosing a different technology!

For some reason Python peeps keep trying to do actual computations in Python...

replies(2): >>44010362 #>>44013278 #

20. charleshn ◴[16 May 25 17:31 UTC] No.44007950[source]▶

>>44006145 #

> Spawning processes generally takes much less than 1 ms on Unix

It depends on whether one uses clone, fork, posix_spawn etc.

Fork can take a while depending on the size of the address space, number of VMAs etc.

replies(2): >>44009524 #>>44009676 #

21. jcalvinowens ◴[16 May 25 17:41 UTC] No.44008025{3}[source]▶

>>44007457 #

> But a thread can still crash in its own due to unhandled oom, assertion failures or any number of other issues

That's not really true on POSIX. Unless you're doing nutty things with clone(), or you actually have explicit code that calls pthread_exit() or gettid()/pthread_kill(), the whole process is always going to die at the same time.

POSIX signal dispositions are process-wide, the only way e.g. SIGSEGV kills a single thread is if you write an explicit handler which actually does that by hand. Unhandled exceptions usually SIGABRT, which works the same way.

** Just to expand a bit: there is a subtlety in that, while dispositions are process-wide, one individual thread does indeed take the signal. If the signal is handled, only that thread sees -EINTR from a blocking syscall; but if the signal is not handled, the default disposition affects all threads in the process simultaneously no matter which thread is actually signalled.

replies(1): >>44009232 #

22. reliabilityguy ◴[16 May 25 17:55 UTC] No.44008154{3}[source]▶

>>44006004 #

What’s the point? The whole idea is to share an object, and not to serialize them whether it’s json, pickle, or whatever.

replies(1): >>44008726 #

23. notpushkin ◴[16 May 25 18:09 UTC] No.44008294{5}[source]▶

>>44006072 #

Take a look at https://capnproto.org/ as well, while at it.

Neither solve the copying problem, though.

replies(2): >>44010278 #>>44010304 #

24. notpushkin ◴[16 May 25 18:14 UTC] No.44008341[source]▶

>>44005006 #

We need a dataclass-like interface on top of a ShareableList.

replies(1): >>44010745 #

25. bradleybuda ◴[16 May 25 18:31 UTC] No.44008492[source]▶

>>44007267 #

Fargate is just a container runtime. You can fork processes and share memory like you can in any other Linux environment. You may not want to (because you are running many cheap / small containers) but if your Fargate containers are running 0.25 vCPUs then you probably don't want traditional multiprocessing or multithreading...

replies(1): >>44010396 #

26. tomrod ◴[16 May 25 18:55 UTC] No.44008726{4}[source]▶

>>44008154 #

I mean, the answer to this is pretty straightforward -- because we can, not because we should :)

27. LPisGood ◴[16 May 25 19:14 UTC] No.44008877[source]▶

>>44006145 #

My understanding is that spawning a thread takes just a few micro seconds, so whether you’re talking about a process or a Python interpreter process there are still orders of magnitude to be gained.

28. wahern ◴[16 May 25 19:55 UTC] No.44009232{4}[source]▶

>>44008025 #

It would be nice if someday we got per-thread signal handlers to complement per-thread signal masking and per-thread alternate signal stacks.

replies(1): >>44014419 #

29. crackez ◴[16 May 25 20:27 UTC] No.44009524{3}[source]▶

>>44007950 #

Fork on Linux should use copy-on-write vmpages now, so if you fork inside python it should be cheap. If you launch a new Python process from let's say the shell, and it's already in the buffer cache, then you should only have to pay the startup CPU cost of the interpreter, since the IO should be satisfied from buffer cache...

replies(1): >>44010504 #

30. knome ◴[16 May 25 20:45 UTC] No.44009676{3}[source]▶

>>44007950 #

for glibc and linux, fork just calls clone. as does posix_spawn, using the flag CLONE_VFORK.

31. kragen ◴[16 May 25 20:45 UTC] No.44009682{4}[source]▶

>>44007828 #

While I think you've been using Unix longer than I have, shell scripts are known for performing very poorly, and on PDP-11 Unix (where perhaps shell scripts were most heavily used, since Perl didn't exist yet) fork() couldn't even do copy-on-write; it had to literally copy the process's entire data segment, which in most cases also contained a copy of its code. Moving to paged machines like the VAX and especially the 68000 family made it possible to use copy-on-write, but historically speaking, Linux has often been an order of magnitude faster than most other Unices at fork(). However, I think people mostly don't use those Unices anymore. I imagine the BSDs have pretty much caught up by now.

https://news.ycombinator.com/item?id=44009754 gives some concrete details on fork() speed on current Linux: 50μs for a small process, 700μs for a regular process, 1300μs for a venti Python interpreter process, 30000–50000μs for Python interpreter creation. This is on a CPU of about 10 billion instructions per second per core, so forking costs on the order of ½–10 million instructions.

32. kragen ◴[16 May 25 20:55 UTC] No.44009754[source]▶

>>44006145 #

To be concrete about this, http://canonical.org/~kragen/sw/dev3/forkovh.c took 670μs to fork, exit, and wait on the first laptop I tried it on, but only 130μs compiled with dietlibc instead of glibc, and with glibc on a 2.3 GHz E5-2697 Xeon, it took 130μs compiled with glibc.

httpdito http://canonical.org/~kragen/sw/dev3/server.s (which launches a process per request) seems to take only about 50μs because it's not linked with any C library and therefore only maps 5 pages. Also, that doesn't include the time for exit() because it runs multiple concurrent child processes.

On this laptop, a Ryzen 5 3500U running at 2.9GHz, forkovh takes about 330μs built with glibc and about 130–140μs built with dietlibc, and `time python3 -c True` takes about 30000–50000μs. I wrote a Python version of forkovh http://canonical.org/~kragen/sw/dev3/forkovh.py and it takes about 1200μs to fork(), _exit(), and wait().

If anyone else wants to clone that repo and test their own machines, I'm interested to hear the results, especially if they aren't in Linux. `make forkovh` will compile the C version.

1200μs is pretty expensive in some contexts but not others. Certainly it's cheaper than spawning a new Python interpreter by more than an order of magnitude.

replies(1): >>44011765 #

33. jaoane ◴[16 May 25 20:55 UTC] No.44009755[source]▶

>>44006145 #

>Spawning a PYTHON interpreter process might take 30 ms to 300 ms before you get to main(), depending on the number of imports

That's lucky. On constrained systems launching a new interpreter can very well take 10 seconds. Python is ssssslllloooowwwww.

34. morningsam ◴[16 May 25 21:02 UTC] No.44009805[source]▶

>>44006145 #

>Spawning a PYTHON interpreter process might take 30 ms to 300 ms

Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter, which takes low-single-digit ms as well.

replies(2): >>44010280 #>>44011993 #

35. ori_b ◴[16 May 25 21:40 UTC] No.44010011[source]▶

>>44006145 #

As another example: I run https://shithub.us with shell scripts, serving a terabyte or so of data monthly (mostly due to AI crawlers that I can't be arsed to block).

I'm launching between 15 and 3000 processes per request. While Plan 9 is about 10x faster at spawning processes than Linux, it's telling that 3000 C processes launching in a shell is about as fast as one python interpreter.

replies(1): >>44010564 #

36. fredoralive ◴[16 May 25 21:48 UTC] No.44010066{4}[source]▶

>>44007828 #

Python runs on other operating systems, like NT, where AIUI processes are rather more heavyweight.

Not all use cases of Python and Windows intersect (how much web server stuff is a Windows / IIS / SQL Server / Python stack? Probably not many, although WISP is a nice acronym), but you’ve still got to bear it in mind for people doing heavy numpy stuff on their work laptop or whatever.

37. ◴[16 May 25 22:20 UTC] No.44010278{6}[source]▶

>>44008294 #

38. zahlman ◴[16 May 25 22:20 UTC] No.44010280{3}[source]▶

>>44009805 #

Even when the 'spawn' strategy is used (default on Windows, and can be chosen explicitly on Linux), the overhead can largely be avoided. (Why choose it on Linux? Apparently forking can cause problems if you also use threads.) Python imports can be deferred (`import` is a statement, not a compiler or pre-processor directive), and child processes (regardless of the creation strategy) name the main module as `__mp_main__` rather than `__main__`, allowing the programmer to distinguish. (Being able to distinguish is of course necessary here, to avoid making a fork bomb - since the top-level code runs automatically and `if __name__ == '__main__':` is normally top-level code.)

But also keep in mind that cleanup for a Python process also takes time, which is harder to trace.

Refs:

https://docs.python.org/3/library/multiprocessing.html#conte... https://stackoverflow.com/questions/72497140

replies(1): >>44010625 #

39. frollogaston ◴[16 May 25 22:25 UTC] No.44010304{6}[source]▶

>>44008294 #

Ah, I forgot capnproto doesn't let you edit a serialized proto in-memory, it's read-only. In theory this should be possible as long as you're not changing the length of anything, but I'm not surprised such trickery is unsupported.

So this doesn't seem like a versatile solution for sharing data structs between two Python processes. You're gonna have to reserialize the whole thing if one side wants to edit, which is basically copying.

40. frollogaston ◴[16 May 25 22:25 UTC] No.44010306{4}[source]▶

>>44007146 #

oh nvm, that doesn't solve this either

41. zahlman ◴[16 May 25 22:26 UTC] No.44010315[source]▶

>>44006103 #

> I wish more effort was put into baseline performance for Python.

There has been. That's why the bytecode is incompatible between minor versions. It was a major selling(?) point for 3.11 and 3.12 in particular.

But the "Faster CPython" team at Microsoft was apparently just laid off (https://www.linkedin.com/posts/mdboom_its-been-a-tough-coupl...), and all of the optimization work has to my understanding been based around fairly traditional techniques. The C part of the codebase has decades of legacy to it, after all.

Alternative implementations like PyPy often post impressive results, and are worth checking out if you need to worry about native Python performance. Not to mention the benefits of shifting the work onto compiled code like NumPy, as you already do.

replies(1): >>44011991 #

42. frollogaston ◴[16 May 25 22:32 UTC] No.44010354[source]▶

>>44006103 #

"I wonder why people never complained so much about JavaScript not having shared-everything threading"

Mainly cause Python is often used for data pipelines in ways that JS isn't, causing situations where you do want to use multiple CPU cores with some shared memory. If you want to use multiple CPU cores in NodeJS, usually it's just a load-balancing webserver without IPC and you just use throng, or maybe you've got microservices.

Also, JS parallelism simply excelled from the start at waiting on tons of IO, there was no confusion about it. Python later got asyncio for this, and by now regular threads have too much momentum. Threads are the worst of both worlds in Py, cause you get the overhead of an OS thread and the possibility of race conditions without the full parallelism it's supposed to buy you. And all this stuff is confusing to users.

43. frollogaston ◴[16 May 25 22:33 UTC] No.44010362{3}[source]▶

>>44007861 #

Python peeps tend to do heavy numbers calc in numpy, but sometimes you're doing expensive things with dictionaries/lists.

44. tinix ◴[16 May 25 22:37 UTC] No.44010396{3}[source]▶

>>44008492 #

Go try it and report back.

Fargate isn't just ECS and plain containers.

You cannot use shared memory in fargate, there is literally no /dev/shm.

See "sharedMemorySize" here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/...

> If you're using tasks that use the Fargate launch type, the sharedMemorySize parameter isn't supported.

45. sgarland ◴[16 May 25 22:42 UTC] No.44010425[source]▶

>>44007267 #

Well don’t use Fargate, there’s your problem. Run programs on actual servers, not magical serverless bullshit.

replies(1): >>44034432 #

46. sgarland ◴[16 May 25 22:46 UTC] No.44010473[source]▶

>>44005006 #

So don’t do that? Send data to workers as primitives, and have a separate process that reads the results and serializes it into whatever form you want.

47. charleshn ◴[16 May 25 22:51 UTC] No.44010504{4}[source]▶

>>44009524 #

> Fork on Linux should use copy-on-write vmpages now, so if you fork inside python it should be cheap.

No, that's exactly the point I'm making, copying PTEs is not cheap on a large address space, woth many VMAs.

You can run a simple python script allocating a large list and see how it affects fork time.

replies(1): >>44011430 #

48. kstrauser ◴[16 May 25 23:00 UTC] No.44010564{3}[source]▶

>>44010011 #

The interpreter itself is pretty quick:

  ᐅ time echo "print('hi'); exit()" | python
  hi
  
  ________________________________________________________
  Executed in   21.48 millis    fish           external
     usr time   16.35 millis  146.00 micros   16.20 millis
     sys time    4.49 millis  593.00 micros    3.89 millis

replies(1): >>44011700 #

49. kstrauser ◴[16 May 25 23:11 UTC] No.44010625{4}[source]▶

>>44010280 #

I really wish Python had a way to annotate things you don't care about cleaning up. I don't know what the API would look like, but I imagine something like:

  l = list(cleanup=False)
  for i in range(1_000_000_000): l.append(i)

telling the runtime that we don't need to individually GC each of those tiny objects and just let the OS's process model free the whole thing at once.

Sure, close TCP connections before you kill the whole thing. I couldn't care less about most objects, though.

replies(4): >>44010738 #>>44010861 #>>44010911 #>>44012256 #

50. zahlman ◴[16 May 25 23:34 UTC] No.44010738{5}[source]▶

>>44010625 #

You'd presumably need to do something involving weakrefs, since it would be really bad if you told Python that the elements can be GCd at all (never mind whether it can be done all at once) but someone else had a reference.

Or completely rearchitect the language to have a model of automatic (in the C sense) allocation. I can't see that ever happening.

replies(1): >>44010924 #

51. notpushkin ◴[16 May 25 23:35 UTC] No.44010745{3}[source]▶

>>44008341 #

Actually, ShareableList feels like a tuple really (as it’s impossible to change its length). If we could mix ShareableList and collections.namedtuple together, it would get us 90% there (99.9% if we use typing.NamedTuple). Unfortunately, I can’t decipher either one [1, 2] from the first glance – maybe if I get some more sleep?

[1]: https://github.com/python/cpython/blob/3.13/Lib/collections/...

[2]: https://github.com/python/cpython/blob/3.13/Lib/typing.py#L2...

52. duped ◴[16 May 25 23:56 UTC] No.44010861{5}[source]▶

>>44010625 #

Tbh if you're optimizing python code you've already lost

replies(2): >>44010925 #>>44011731 #

53. Izkata ◴[17 May 25 00:07 UTC] No.44010911{5}[source]▶

>>44010625 #

There's already a global:

  import gc
  gc.disable()

So I imagine putting more in there to remove objects from the tracking.

replies(1): >>44010998 #

54. kstrauser ◴[17 May 25 00:09 UTC] No.44010924{6}[source]▶

>>44010738 #

I don't think either of those are true. I'm not arguing against cleaning up objects during the normal runtime. What I'd like is something that would avoid GC'ing objects one-at-a-time at program shutdown.

I've had cases where it took Python like 30 seconds to exit after I'd slurped a large CSV with a zillion rows into RAM. At that time, I'd dreamed of a way to tell Python not to bother free()ing any of that, just exit() and let Linux unmap RAM all at once. If you think about it, there probably aren't that many resources you actually care about individually freeing on exit. I'm certain somewill will prove me wrong, but at a first pass, objects that don't define __del__ or __exit__ probably don't care how you destroy them.

replies(1): >>44011521 #

55. kstrauser ◴[17 May 25 00:09 UTC] No.44010925{6}[source]▶

>>44010861 #

Run along.

56. kstrauser ◴[17 May 25 00:21 UTC] No.44010998{6}[source]▶

>>44010911 #

That can go a long way, so long as you remember to manually GC the handful of things you do care about.

replies(2): >>44013519 #>>44016081 #

57. charleshn ◴[17 May 25 01:48 UTC] No.44011430{5}[source]▶

>>44010504 #

See e.g. https://www.alibabacloud.com/blog/async-fork-mitigating-quer...

58. zahlman ◴[17 May 25 02:11 UTC] No.44011521{7}[source]▶

>>44010924 #

Ah.

I imagine the problem is that `__del__` could be monkeypatched, so Python doesn't strictly know what needs custom finalization until that moment.

But if you have a concrete proposal, it's likely worth shopping around at https://discuss.python.org/c/ideas/6 or https://github.com/python/cpython/issues/ .

replies(1): >>44011632 #

59. kstrauser ◴[17 May 25 02:39 UTC] No.44011632{8}[source]▶

>>44011521 #

I might do that. It’s nothing I’ve thought about in depth, just an occasionally recurring idea that bugs me every now and then.

60. ori_b ◴[17 May 25 02:57 UTC] No.44011700{4}[source]▶

>>44010564 #

My machine is marginally faster for that; I get about 17ms doing that with python, without the print:

    time echo "exit()" | python3

    real    0m0.017s
    user    0m0.014s
    sys     0m0.003s

That's... still pretty slow. Here's a C program, run 100 times:

    range=`seq 100`
    time for i in $range; do ./a.out; done

    real    0m0.038s
    user    0m0.024s
    sys     0m0.016s

And finally, for comparison on Plan 9:

   range=`{seq 2000}
   time rc -c 'for(s in $range){ ./6.nop }'

   0.01u 0.09s 0.16r   rc -c for(s in $range){ ./6.nop }

the C program used was simply:

   int main(void) { return 0; }

Of course, the more real work you do in the program, the less it matters -- but there's a hell of a lot you can do in the time it takes Python to launch.

61. kragen ◴[17 May 25 03:01 UTC] No.44011731{6}[source]▶

>>44010861 #

On a 64-core machine, Python code that uses all the cores will be modestly faster than single-threaded C, even if all the inner loops are in Python. If you can move the inner loops to C, for example with Numpy, you can do much better still. (Python is still harder to get right than something like C or OCaml, of course, especially for larger programs, but often the smaller amount of code and quicker feedback loop can compensate for that.)

replies(1): >>44030303 #

62. kragen ◴[17 May 25 03:12 UTC] No.44011765{3}[source]▶

>>44009754 #

On my cellphone forkovh is 700μs and forkovh.py is 3700μs. Qualcomm SDM630. All the forkovh numbers are with 102400 bytes of data.

63. monkeyelite ◴[17 May 25 03:32 UTC] No.44011843[source]▶

>>44006103 #

> I wonder why people never complained so much about JavaScript not having shared-everything threading

Because it greatly simplifies the language and gives you all kinds of invariants.

64. csense ◴[17 May 25 04:11 UTC] No.44011991{3}[source]▶

>>44010315 #

Yeah, when I'm having Python performance issues, my first instinct is to reach for Pypy. My second instinct is to rewrite the "hot" part in C or Rust.

65. codethief ◴[17 May 25 04:13 UTC] No.44011993{3}[source]▶

>>44009805 #

> Which is why, at least on Linux, Python's multiprocessing doesn't do that but fork()s the interpreter

…which can also be a great source of subtle bugs if you're writing a cross-platform application.

66. Too ◴[17 May 25 05:33 UTC] No.44012256{5}[source]▶

>>44010625 #

Never experienced this. If this is truly a problem, here is a sledgehammer, just beware it will not close your tcp connections gracefully: os.kill(os.getpid(), signal.SIGKILL).

67. Too ◴[17 May 25 05:54 UTC] No.44012318[source]▶

>>44006145 #

In Python, if you are spawning processes or even threads in a tight loop you have already lost. Use ThreadPoolExecutor or ProcessPoolExecutor from concurrent.futures instead. Then startup time becomes no factor.

68. saagarjha ◴[17 May 25 06:17 UTC] No.44012374{4}[source]▶

>>44007828 #

Yes, this is why using shell scripts on macOS is miserable

69. perlgeek ◴[17 May 25 09:32 UTC] No.44013159[source]▶

>>44004897 (TP) #

> Never understood why this isn’t used more frequently.

Can you throw a JSON-serializable data structure (lists, dict, strings, numbers) into SharedMemory? What about regular instance of random Python classes? If the answer is "no", that explains why it's not done more often.

The examples in the docs seem to pass byte strings and byte arrays around, which is far less convenient than regular data structures.

replies(1): >>44013244 #

70. dragonwriter ◴[17 May 25 10:01 UTC] No.44013244[source]▶

>>44013159 #

> Can you throw a JSON-serializable data structure (lists, dict, strings, numbers) into SharedMemory?

You can throw a JSON-serialized data structure into SharedMemory, sure, since you can store strings.

> The examples in the docs seem to pass byte strings and byte arrays around

The examples in the docs largely use ShareableList, which itself can contain any of int, float, bool, str, bytes, and None-type values.

71. dragonwriter ◴[17 May 25 10:10 UTC] No.44013278{3}[source]▶

>>44007861 #

> For some reason Python peeps keep trying to do actual computations in Python...

Mostly, Python peeps do heavy calculation in not-really-Python (even if it is embedded in and looks like Python), e.g., via numpy, numba, taichi, etc.

72. MonkeyClub ◴[17 May 25 11:26 UTC] No.44013519{7}[source]▶

>>44010998 #

And then we're back to manual memory management.

At least the objects get instantiated automatically, and you don't need to malloc() them into existence yourself; I guess that's still something.

73. seunosewa ◴[17 May 25 11:55 UTC] No.44013651[source]▶

>>44006145 #

You can use a pool of interpreter processes. You don't have to spawn one for each request.

74. bobmcnamara ◴[17 May 25 12:05 UTC] No.44013686{4}[source]▶

>>44007828 #

Before we had MMUs and CoW, it was awful slow

75. jcalvinowens ◴[17 May 25 14:04 UTC] No.44014419{5}[source]▶

>>44009232 #

You can sort of get that behavior on Linux using clone(..., ~CLONE_THREAD|~CLONE_SIGHAND|CLONE_VM, ...), which creates otherwise distinct processes which share an address space.

You can do all sorts of weird things like create threads which don't share file descriptors, threads which chdir() independently... except that CLONE_THREAD|~CLONE_SIGHAND and CLONE_SIGHAND|~CLONE_VM are disallowed.

76. oconnor663 ◴[17 May 25 18:14 UTC] No.44016003{3}[source]▶

>>44007457 #

I think this is conflating two different things. A Rust Mutex gets poisoned if the thread holding it panics, but that's not the same thing as evaporating into thin air. Destructors run while a panic unwinds (indeed this is how the Mutex poisons itself), and you usually have the option of catching panics if you want. In the panic=abort configuration, where you can't catch a panic, it takes down the whole process rather than just one thread, which is another way of making the same point here: you can't usually kill a thread independently of the whole process its in, because lots of things (like locks) assume you'll never do that.

77. westurner ◴[17 May 25 18:29 UTC] No.44016081{7}[source]▶

>>44010998 #

Is there a good way to add __del__() methods or to wrap Context Manager __enter__()/__exit__() methods around objects that never needed them because of the gc?

Hadn't seen this:

  import gc
  gc.disable()

Cython has __dealloc__() instead of __del__()?

replies(1): >>44016148 #

78. westurner ◴[17 May 25 18:40 UTC] No.44016148{8}[source]▶

>>44016081 #

Also, there's a recent proposal to add explicit resource management to JS: "JavaScript's New Superpower: Explicit Resource Management" https://news.ycombinator.com/item?id=44012227

79. duped ◴[19 May 25 14:31 UTC] No.44030303{7}[source]▶

>>44011731 #

I strongly doubt this claim. Python is more than 64x slower than C without synchronization overhead in most numeric tasks, with synchronization overhead on those processes it should be much worse.

Python is so much slower than any native or JIT compiled language that it begets things like numpy in the first place.

replies(1): >>44075783 #

80. tinix ◴[19 May 25 20:26 UTC] No.44034432{3}[source]▶

>>44010425 #

> Well don’t use Fargate, there’s your problem. Run programs on actual servers, not magical serverless bullshit.

That kind of absolutism misses the point of why serverless architectures like Fargate exist. It might feel satisfying, but it closes the door on understanding why stateless and ephemeral workloads exist in the first place.

I get the frustration, but dismissing a production architecture outright ignores the constraints and trade-offs that lead teams to adopt it in the first place. It's worth asking: if so many teams are using this shit in production, at scale, with real stakes, what do they know that might be missing from my current mental model?

Serverless, like any abstraction, isn't magic. It's a tool with defined trade-offs, and resource/process isolation is one of them. If you're running containerized workloads at scale, optimizing for organizational velocity, security boundaries, and multi-tenant isolation, these constraints aren't bullshit, they're literally design parameters and intentional boundaries.

It's easy to throw shade from a distance, but the operational reality of running modern systems, especially in regulated or high-scale environments, looks very different from a home lab or startup sandbox.

81. kragen ◴[23 May 25 19:25 UTC] No.44075783{8}[source]▶

>>44030303 #

My typical experience is about 40×.

↑