←back to thread

62 points eneuman | 2 comments | | HN request time: 0.595s | source
Show context
westurner ◴[] No.43376495[source]
Can this be merged into pandas?

Pandas does not currently install tqdm by default.

pandas-dev/pandas//pyproject.toml [project.optional-dependencies] https://github.com/pandas-dev/pandas/blob/8943c97c597677ae98...

Dask solves for various adjacent problems; IDK if pandas, dask, or dask-cudf would be faster with async?

Dask docs > Scheduling > Dask Distributed (local) https://docs.dask.org/en/stable/scheduling.html#dask-distrib... :

> Asynchronous Futures API

Dask docs > Deploy Dask Clusters; local multiprocessing poll, k8s (docker desktop, podman-desktop,), public and private clouds, dask-jobqueue (SLURM,), dask-mpi: https://docs.dask.org/en/stable/deploying.html#deploy-dask-c...

Dask docs > Dask DataFrame: https://docs.dask.org/en/stable/dataframe.html :

> Dask DataFrames are a collection of many pandas DataFrames.

> The API is the same. The execution is the same.

> [concurrent.futures and/or @dask.delayed]

tqdm.dask: https://tqdm.github.io/docs/dask/#tqdmdask .. tests/tests_pandas.py: https://github.com/tqdm/tqdm/blob/master/tests/tests_pandas.... , tests/tests_dask.py: https://github.com/tqdm/tqdm/blob/master/tests/tests_dask.py

tqdm with dask.distributed: https://github.com/tqdm/tqdm/issues/1230#issuecomment-222379... , not yet a PR: https://github.com/tqdm/tqdm/issues/278#issuecomment-5070062...

dask.diagnostics.progress: https://docs.dask.org/en/stable/diagnostics-local.html#progr...

dask.distributed.progress: https://docs.dask.org/en/stable/diagnostics-distributed.html...

dask-labextension runs in JupyterLab and has a parallel plot visualization of the dask task graph and progress through it: https://github.com/dask/dask-labextension

dask-jobqueue docs > Interactive Use > Viewing the Dask Dashboard: https://jobqueue.dask.org/en/latest/clusters-interactive.htm...

https://examples.dask.org/ > "Embarrassingly parallel Workloads" tutorial re: "three different ways of doing this with Dask: dask.delayed, concurrent.Futures, dask.bag": https://examples.dask.org/applications/embarrassingly-parall...

replies(1): >>43376723 #
1. eneuman ◴[] No.43376723[source]
Thank you for the input! To be honest, I don’t use Dask often, and as a regular Pandas user, I don’t feel the most qualified to comment—but here we go.

Can this be merged into Pandas?

I’d be honored if something I built got incorporated into Pandas! That said, keeping aiopandas as a standalone package has the advantage of working with older Pandas versions, which is useful for workflows where upgrading isn’t feasible. I also can’t speak to the downstream implications of adding this directly into Pandas.

Pandas does not install tqdm by default.

That makes sense, and aiopandas doesn’t require tqdm either. You can pass any class with __init__, update, and close methods as the tqdm argument, and it will work the same. Keeping dependencies minimal helps avoid unnecessary breakage.

What about Dask?

I’m not a regular Dask user, so I can’t comment much on its internals. Dask already supports async coroutines (Dask Async API), but for simple async API calls or LLM requests, aiopandas is meant to be a lightweight extension of Pandas rather than a full-scale parallelization framework. If you’re already using Dask, it probably covers most of what you need, but if you’re just looking to add async support to Pandas without additional complexity, aiopandas might be a more lightweight option.

replies(1): >>43419708 #
2. westurner ◴[] No.43419708[source]
Fair benchmarks would justify merging aiopandas into pandas. Benchmark grid axes: aiopandas, dtype_backend="pyarrow", dask-cudf

pandas pyarrow docs: https://pandas.pydata.org/docs/dev/user_guide/pyarrow.html

/? async pyarrow: https://www.google.com/search?q=async+pyarrow

/? repo:apache/arrow async language:Python : https://github.com/search?q=repo%3Aapache%2Farrow+async+lang... :

test_flight_async.py https://github.com/apache/arrow/blob/main/python/pyarrow/tes...

pyarrow/src/arrow/python/async.h: https://github.com/apache/arrow/blob/main/python/pyarrow/src... : "Bind a Python callback to an arrow::Future."

--

dask-cudf: https://docs.rapids.ai/api/dask-cudf/stable/ :

> Neither Dask cuDF nor Dask DataFrame provide support for multi-GPU or multi-node execution on their own. You must also deploy a dask.distributed cluster to leverage multiple GPUs. We strongly recommend using Dask-CUDA to simplify the setup of the cluster, taking advantage of all features of the GPU and networking hardware.

cudf.pandas > FAQ > "When should I use cudf.pandas vs using the cuDF library directly?" https://docs.rapids.ai/api/cudf/stable/cudf_pandas/faq/#when... :

> cuDF implements a subset of the pandas API, while cudf.pandas will fall back automatically to pandas as needed.

> Can I use cudf.pandas with Dask or PySpark?

> [Not at this time, though you can change the dask df to e.g. cudf, which does not implement the full pandas dataframe API]

--

dask.distributed docs > Asynchronous Operation; re Tornado or asyncio: https://distributed.dask.org/en/latest/asynchronous.html#asy...

--

tqdm.dask, tqdm.notebook: https://github.com/tqdm/tqdm#ipythonjupyter-integration

  from tqdm.notebook import trange, tqdm
  for n in trange(10):
      time.sleep(1)
--

But then TPUs instead of or in addition to async GPUs;

TensorFlow TPU docs: https://www.tensorflow.org/guide/tpu