FireDucks: Pandas but Faster

(hwisnu.bearblog.dev)

398 points sebg | 5 comments | 14 Nov 24 11:48 UTC | HN request time: 1.118s | source

Show context

rich_sasha ◴[20 Nov 24 11:56 UTC] No.42193043[source]▶

It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

replies(22): >>42193093 #>>42193139 #>>42193143 #>>42193309 #>>42193374 #>>42193380 #>>42193693 #>>42193936 #>>42194067 #>>42194113 #>>42194302 #>>42194361 #>>42194490 #>>42194544 #>>42194670 #>>42195628 #>>42196720 #>>42197192 #>>42197489 #>>42198158 #>>42199832 #>>42200060 #

1. sega_sai ◴[20 Nov 24 12:26 UTC] No.42193309[source]▶

>>42193043 #

Great point that I completely share. I tend to avoid pandas at all costs except for very simple things as I have bitten by many issues related to indexing. For anything complicated I tend to switch to duckdb instead.

replies(1): >>42193551 #

2. bravura ◴[20 Nov 24 13:00 UTC] No.42193551[source]▶

>>42193309 (TP) #

Can you explain your use-case and why DuckDB is better?

Considering switching from pandas and want to understand what is my best bet. I am just processing feature vectors that are too large for memory, and need an initial simple JOIN to aggregate them.

replies(2): >>42193825 #>>42195551 #

3. sega_sai ◴[20 Nov 24 13:37 UTC] No.42193825[source]▶

>>42193551 #

I am not necessarily saying duckdb is better. I personally just found it easier, clearer to write a sql query for any complicated set of joins/group by processing than to try to do that in pandas.

4. rapatel0 ◴[20 Nov 24 16:30 UTC] No.42195551[source]▶

>>42193551 #

Look into [Ibis](https://ibis-project.org/). It's a dataframe library built on duckdb. It supports lazy execution, greater than memory datastructures, remote s3 data and is insanely fast. Also works with basically any backend (postgres, mysql, parquet/csv files, etc) though there are some implementation gaps in places.

I previously had a pandas+sklearn transformation stack that would take up to 8 hours. Converted it to ibis and it executes in about 4 minutes now and doesn't fill up RAM.

It's not a perfect apples to apples pandas replacement but really a nice layer on top of sql. after learning it, I'm almost as fast as I was on pandas with expressions.

replies(1): >>42196216 #

5. techwizrd ◴[20 Nov 24 17:27 UTC] No.42196216{3}[source]▶

>>42195551 #

I made the switch to Ibis a few months ago and have been really enjoying it. It works with all the plotting libraries including seaborn and plotnine. And it makes switching from testing on a CSV to running on a SQL/Spark a one-line change. It's just really handy for analysis (similar to the tidyverse).

↑