Show HN: Aiopandas – Async .apply() and .map() for Pandas, Faster API/LLMs Calls

(github.com)

1. gardnr ◴[15 Mar 25 21:41 UTC] No.43375347[source]▶

This is super useful. So much boilerplate code to run async calls and gather. I have been using tqdm.gather() so I am glad to see this library supports it.

replies(1): >>43376358 #

2. eneuman ◴[16 Mar 25 01:33 UTC] No.43376358[source]▶

>>43375347 #

Thanks! I originally built this to scratch an itch I had, so I’m really glad you find it useful too. If you have any ideas for improvements or missing features, feel free to suggest them — or even open a PR!

3. westurner ◴[16 Mar 25 02:13 UTC] No.43376495[source]▶

>>43374505 (OP) #

Can this be merged into pandas?

Pandas does not currently install tqdm by default.

pandas-dev/pandas//pyproject.toml [project.optional-dependencies] https://github.com/pandas-dev/pandas/blob/8943c97c597677ae98...

Dask solves for various adjacent problems; IDK if pandas, dask, or dask-cudf would be faster with async?

Dask docs > Scheduling > Dask Distributed (local) https://docs.dask.org/en/stable/scheduling.html#dask-distrib... :

> Asynchronous Futures API

Dask docs > Deploy Dask Clusters; local multiprocessing poll, k8s (docker desktop, podman-desktop,), public and private clouds, dask-jobqueue (SLURM,), dask-mpi: https://docs.dask.org/en/stable/deploying.html#deploy-dask-c...

Dask docs > Dask DataFrame: https://docs.dask.org/en/stable/dataframe.html :

> Dask DataFrames are a collection of many pandas DataFrames.

> The API is the same. The execution is the same.

> [concurrent.futures and/or @dask.delayed]

tqdm.dask: https://tqdm.github.io/docs/dask/#tqdmdask .. tests/tests_pandas.py: https://github.com/tqdm/tqdm/blob/master/tests/tests_pandas.... , tests/tests_dask.py: https://github.com/tqdm/tqdm/blob/master/tests/tests_dask.py

tqdm with dask.distributed: https://github.com/tqdm/tqdm/issues/1230#issuecomment-222379... , not yet a PR: https://github.com/tqdm/tqdm/issues/278#issuecomment-5070062...

dask.diagnostics.progress: https://docs.dask.org/en/stable/diagnostics-local.html#progr...

dask.distributed.progress: https://docs.dask.org/en/stable/diagnostics-distributed.html...

dask-labextension runs in JupyterLab and has a parallel plot visualization of the dask task graph and progress through it: https://github.com/dask/dask-labextension

dask-jobqueue docs > Interactive Use > Viewing the Dask Dashboard: https://jobqueue.dask.org/en/latest/clusters-interactive.htm...

https://examples.dask.org/ > "Embarrassingly parallel Workloads" tutorial re: "three different ways of doing this with Dask: dask.delayed, concurrent.Futures, dask.bag": https://examples.dask.org/applications/embarrassingly-parall...

replies(1): >>43376723 #

4. napsternxg ◴[16 Mar 25 02:34 UTC] No.43376582[source]▶

>>43374505 (OP) #

This is a very clean api and I really like the way you implemented it directly in Pandas. I worked on something similar 2 years back but the API was not as this one. Thanks a lot to making this.

replies(1): >>43376739 #

5. eneuman ◴[16 Mar 25 03:15 UTC] No.43376723[source]▶

>>43376495 #

Thank you for the input! To be honest, I don’t use Dask often, and as a regular Pandas user, I don’t feel the most qualified to comment—but here we go.

Can this be merged into Pandas?

I’d be honored if something I built got incorporated into Pandas! That said, keeping aiopandas as a standalone package has the advantage of working with older Pandas versions, which is useful for workflows where upgrading isn’t feasible. I also can’t speak to the downstream implications of adding this directly into Pandas.

Pandas does not install tqdm by default.

That makes sense, and aiopandas doesn’t require tqdm either. You can pass any class with __init__, update, and close methods as the tqdm argument, and it will work the same. Keeping dependencies minimal helps avoid unnecessary breakage.

What about Dask?

I’m not a regular Dask user, so I can’t comment much on its internals. Dask already supports async coroutines (Dask Async API), but for simple async API calls or LLM requests, aiopandas is meant to be a lightweight extension of Pandas rather than a full-scale parallelization framework. If you’re already using Dask, it probably covers most of what you need, but if you’re just looking to add async support to Pandas without additional complexity, aiopandas might be a more lightweight option.

replies(1): >>43419708 #

6. eneuman ◴[16 Mar 25 03:19 UTC] No.43376739[source]▶

>>43376582 #

You’re more than welcome! I really appreciate the kind words.

If you have any ideas for improvements, missing features, or run into any issues, don't hesitate to share!

7. refactor_master ◴[16 Mar 25 04:19 UTC] No.43376912[source]▶

>>43374505 (OP) #

It seems like this hack would be fine for notebooks, but not something I’d be interested in for production code.

Why not just something like this?

  def f(n):
      time.sleep(random.uniform(0.1, 0.3))  # Simulate network delay
      return pd.DataFrame({"A": [n, n+1], "B": [n*2, (n+1)*2]})

  with ThreadPoolExecutor() as ex:
    df = pd.concat(ex.map(f, range(3)), ignore_index=True)

replies(2): >>43377514 #>>43377647 #

8. isoprophlex ◴[16 Mar 25 07:57 UTC] No.43377514[source]▶

>>43376912 #

indeed... the longer i write python, the more i just try to solve stuff with a simple ThreadPoolExecutor.

I think doing this is not the best choice for cpu-bound work, which is likely what you're running into with pandas, but nevertheless... I like how you can almost always slap a threadpool onto something and speed things up, with minimal cognitive overhead.

replies(3): >>43377574 #>>43377761 #>>43379109 #

9. hn8726 ◴[16 Mar 25 08:18 UTC] No.43377574{3}[source]▶

>>43377514 #

> not the best choice for cpu-bound work, which is likely what you're running into with pandas

I'm not a Python user, why is it not good for cpu-bound work? I see the defaults assume some I/O work, but with `max_workers=~cpu_count` it should be what typical dispatchers for CPU-bound work do in other languages

replies(1): >>43377774 #

10. dkh ◴[16 Mar 25 08:36 UTC] No.43377647[source]▶

>>43376912 #

These are two different paradigms. aiopandas is not trying to offload pandas work somewhere else to prevent it from blocking synchronous code, it's trying to let you apply asynchronous functions to pandas operations concurrently while running on the event loop inside of other async code.

That said, this is mostly just going to be helpful if you're running pandas operations that call an external API on each iteration or something, and the actual pandas part of the work is still going to be CPU-bound and block. I am also not a huge fan of the monkey-patching approach. But it's clever and will definitely be useful to folks doing a very specific kind of work

11. dkh ◴[16 Mar 25 09:10 UTC] No.43377761{3}[source]▶

>>43377514 #

The intended use-case for this is actually very different from what you describe, and one where aiopandas would be much faster than a ThreadPoolExecutor.

Lets say that you have a pandas dataframe and you want to use `pandas.map` to run a function on every element of it where, for some reason, the new value is determined by making an API request with the current value. No matter whether you do this in the main thread or in a threadpool, it's going to run these one at a time, and very slowly. You can make X number of requests at once inside a thread pool where X is the number of workers you set, but this number is not usually very high, and running http requests asynchronously is going to absolutely wipe the floor with your thread pool. You can run hundreds to thousands of concurrent http requests per second on asyncio.

So yes, the actual work that pandas has to do in terms of inserting/modifying the dataframe, that's all CPU-bound, and it's going to block. But 95%+ of the wait time you'd experience doing this synchronously would be just waiting for those http requests to finish. The pandas work is CPU-bound, but each iteration would probably be measured in milliseconds. In this use-case, this library (assuming it works as described) would be far superior, by many multiples if not an order of magnitude.

That said, I have absolutely no idea who is making http requests on each iteration of a pandas map, or what percentage of that group of people didn't solve it some other way.

replies(1): >>43379115 #

12. dkh ◴[16 Mar 25 09:17 UTC] No.43377774{4}[source]▶

>>43377574 #

Python "threads" aren't real threads in the traditional sense because Python's Global Interpreter Lock (GIL) exists, and this means no more than one thread is ever actually running in parallel. They are great for network IO since most IO is just spent waiting for stuff rather than computing anything, but you can't actually run CPU-heavy stuff on multiple Python threads and have the speed multiplier be equal to the number of thread workers. For this, you have to use process pools. (Though this is something that is in the process of finally being alleviated/fixed!)

replies(1): >>43378161 #

13. lyu07282 ◴[16 Mar 25 11:21 UTC] No.43378161{5}[source]▶

>>43377774 #

This seems all a bit misleading to beginners, if you have numerical cpu-bound work in Python what you should be doing is vectorize it, not parallelize.

https://www.geeksforgeeks.org/vectorized-operations-in-numpy...

replies(1): >>43379089 #

14. dkh ◴[16 Mar 25 13:52 UTC] No.43379089{6}[source]▶

>>43378161 #

The point is that the use-case here is one where there is far more IO-bound work than CPU-bound.

replies(1): >>43379909 #

15. ◴[16 Mar 25 13:55 UTC] No.43379109{3}[source]▶

>>43377514 #

16. dkh ◴[16 Mar 25 13:55 UTC] No.43379115{4}[source]▶

>>43377761 #

As a very simple example, here's aiohttp making 10,000 http requests (HEAD requests to a list of different urls) in a single thread but asynchronously vs. ThreadPoolExecutor making them synchronously but across 32 workers (I had to drastically reduce the number of urls in order to make sitting through it bearable): https://asciinema.org/a/MkoOVQBSeBanRRZtsu3xe5FUk

17. ◴[16 Mar 25 15:45 UTC] No.43379909{7}[source]▶

>>43379089 #

18. westurner ◴[20 Mar 25 03:39 UTC] No.43419708{3}[source]▶

>>43376723 #

Fair benchmarks would justify merging aiopandas into pandas. Benchmark grid axes: aiopandas, dtype_backend="pyarrow", dask-cudf

pandas pyarrow docs: https://pandas.pydata.org/docs/dev/user_guide/pyarrow.html

/? async pyarrow: https://www.google.com/search?q=async+pyarrow

/? repo:apache/arrow async language:Python : https://github.com/search?q=repo%3Aapache%2Farrow+async+lang... :

test_flight_async.py https://github.com/apache/arrow/blob/main/python/pyarrow/tes...

pyarrow/src/arrow/python/async.h: https://github.com/apache/arrow/blob/main/python/pyarrow/src... : "Bind a Python callback to an arrow::Future."

dask-cudf: https://docs.rapids.ai/api/dask-cudf/stable/ :

> Neither Dask cuDF nor Dask DataFrame provide support for multi-GPU or multi-node execution on their own. You must also deploy a dask.distributed cluster to leverage multiple GPUs. We strongly recommend using Dask-CUDA to simplify the setup of the cluster, taking advantage of all features of the GPU and networking hardware.

cudf.pandas > FAQ > "When should I use cudf.pandas vs using the cuDF library directly?" https://docs.rapids.ai/api/cudf/stable/cudf_pandas/faq/#when... :

> cuDF implements a subset of the pandas API, while cudf.pandas will fall back automatically to pandas as needed.

> Can I use cudf.pandas with Dask or PySpark?

> [Not at this time, though you can change the dask df to e.g. cudf, which does not implement the full pandas dataframe API]

dask.distributed docs > Asynchronous Operation; re Tornado or asyncio: https://distributed.dask.org/en/latest/asynchronous.html#asy...

tqdm.dask, tqdm.notebook: https://github.com/tqdm/tqdm#ipythonjupyter-integration

  from tqdm.notebook import trange, tqdm
  for n in trange(10):
      time.sleep(1)

But then TPUs instead of or in addition to async GPUs;

TensorFlow TPU docs: https://www.tensorflow.org/guide/tpu