←back to thread

62 points eneuman | 1 comments | | HN request time: 0.281s | source
Show context
refactor_master ◴[] No.43376912[source]
It seems like this hack would be fine for notebooks, but not something I’d be interested in for production code.

Why not just something like this?

  def f(n):
      time.sleep(random.uniform(0.1, 0.3))  # Simulate network delay
      return pd.DataFrame({"A": [n, n+1], "B": [n*2, (n+1)*2]})

  with ThreadPoolExecutor() as ex:
    df = pd.concat(ex.map(f, range(3)), ignore_index=True)
replies(2): >>43377514 #>>43377647 #
isoprophlex ◴[] No.43377514[source]
indeed... the longer i write python, the more i just try to solve stuff with a simple ThreadPoolExecutor.

I think doing this is not the best choice for cpu-bound work, which is likely what you're running into with pandas, but nevertheless... I like how you can almost always slap a threadpool onto something and speed things up, with minimal cognitive overhead.

replies(3): >>43377574 #>>43377761 #>>43379109 #
dkh ◴[] No.43377761[source]
The intended use-case for this is actually very different from what you describe, and one where aiopandas would be much faster than a ThreadPoolExecutor.

Lets say that you have a pandas dataframe and you want to use `pandas.map` to run a function on every element of it where, for some reason, the new value is determined by making an API request with the current value. No matter whether you do this in the main thread or in a threadpool, it's going to run these one at a time, and very slowly. You can make X number of requests at once inside a thread pool where X is the number of workers you set, but this number is not usually very high, and running http requests asynchronously is going to absolutely wipe the floor with your thread pool. You can run hundreds to thousands of concurrent http requests per second on asyncio.

So yes, the actual work that pandas has to do in terms of inserting/modifying the dataframe, that's all CPU-bound, and it's going to block. But 95%+ of the wait time you'd experience doing this synchronously would be just waiting for those http requests to finish. The pandas work is CPU-bound, but each iteration would probably be measured in milliseconds. In this use-case, this library (assuming it works as described) would be far superior, by many multiples if not an order of magnitude.

That said, I have absolutely no idea who is making http requests on each iteration of a pandas map, or what percentage of that group of people didn't solve it some other way.

replies(1): >>43379115 #
1. dkh ◴[] No.43379115[source]
As a very simple example, here's aiohttp making 10,000 http requests (HEAD requests to a list of different urls) in a single thread but asynchronously vs. ThreadPoolExecutor making them synchronously but across 32 workers (I had to drastically reduce the number of urls in order to make sitting through it bearable): https://asciinema.org/a/MkoOVQBSeBanRRZtsu3xe5FUk