Show HN: Aiopandas – Async .apply() and .map() for Pandas, Faster API/LLMs Calls

Show context

refactor_master ◴[16 Mar 25 04:19 UTC] No.43376912[source]▶

It seems like this hack would be fine for notebooks, but not something I’d be interested in for production code.

Why not just something like this?

  def f(n):
      time.sleep(random.uniform(0.1, 0.3))  # Simulate network delay
      return pd.DataFrame({"A": [n, n+1], "B": [n*2, (n+1)*2]})

  with ThreadPoolExecutor() as ex:
    df = pd.concat(ex.map(f, range(3)), ignore_index=True)

replies(2): >>43377514 #>>43377647 #

isoprophlex ◴[16 Mar 25 07:57 UTC] No.43377514[source]▶

>>43376912 #

indeed... the longer i write python, the more i just try to solve stuff with a simple ThreadPoolExecutor.

I think doing this is not the best choice for cpu-bound work, which is likely what you're running into with pandas, but nevertheless... I like how you can almost always slap a threadpool onto something and speed things up, with minimal cognitive overhead.

replies(3): >>43377574 #>>43377761 #>>43379109 #

hn8726 ◴[16 Mar 25 08:18 UTC] No.43377574[source]▶

>>43377514 #

> not the best choice for cpu-bound work, which is likely what you're running into with pandas

I'm not a Python user, why is it not good for cpu-bound work? I see the defaults assume some I/O work, but with `max_workers=~cpu_count` it should be what typical dispatchers for CPU-bound work do in other languages

replies(1): >>43377774 #

dkh ◴[16 Mar 25 09:17 UTC] No.43377774[source]▶

>>43377574 #

Python "threads" aren't real threads in the traditional sense because Python's Global Interpreter Lock (GIL) exists, and this means no more than one thread is ever actually running in parallel. They are great for network IO since most IO is just spent waiting for stuff rather than computing anything, but you can't actually run CPU-heavy stuff on multiple Python threads and have the speed multiplier be equal to the number of thread workers. For this, you have to use process pools. (Though this is something that is in the process of finally being alleviated/fixed!)

replies(1): >>43378161 #

lyu07282 ◴[16 Mar 25 11:21 UTC] No.43378161[source]▶

>>43377774 #

This seems all a bit misleading to beginners, if you have numerical cpu-bound work in Python what you should be doing is vectorize it, not parallelize.

https://www.geeksforgeeks.org/vectorized-operations-in-numpy...

replies(1): >>43379089 #

dkh ◴[16 Mar 25 13:52 UTC] No.43379089[source]▶

>>43378161 #

The point is that the use-case here is one where there is far more IO-bound work than CPU-bound.

replies(1): >>43379909 #