FireDucks: Pandas but Faster

(hwisnu.bearblog.dev)

398 points sebg | 1 comments | 14 Nov 24 11:48 UTC | HN request time: 0.23s | source

Show context

thecleaner ◴[20 Nov 24 11:13 UTC] No.42192790[source]▶

Sure but single node performance. This makes it not very useful IMO since quite a few data science folks work with Hadoop clusters or Snowflake clusters or DataBricks where data is distributed and querying is handled by Spark executors.

replies(2): >>42193202 #>>42193542 #

Kalanos ◴[20 Nov 24 12:58 UTC] No.42193542[source]▶

>>42192790 #

Hadoop hasn't been relevant for a long time, which is telling.

Unless I had thousands of files to work with, I would be loathe to use cluster computing. There's so much overhead, cost, waiting for nodes to spin up, and cloud architecture nonsense.

My "single node" computer is a refurbished tower server with 256GB RAM and 50 threads.

Most of these distributed computing solutions arose before data processing tools started taking multi-threading seriously.

replies(1): >>42216640 #

1. markhahn ◴[22 Nov 24 19:21 UTC] No.42216640[source]▶

>>42193542 #

understood: big facilities get shared; sharing requires arbitration and queueing.

an interesting angle on 50 threads and 256G: your data is probably pretty cool (cache-friendly). if your threads are merely HT, that's only 25 real cores, and might be only a single socket. implying probably <100 GB/s memory bandwidth. so a best-case touch-all-memory operation would take several seconds. for non-sequential patterns, effective rates would be much lower, and keep cores even less busy.

so cache-friendliness is really the determining feature in this context. I wonder how much these packages are oriented towards cache tuning. it affects basic strategy, such as how filtering is implemented in an expression graph...

↑