←back to thread

FireDucks: Pandas but Faster

(hwisnu.bearblog.dev)
374 points sebg | 1 comments | | HN request time: 0.001s | source
Show context
rich_sasha ◴[] No.42193043[source]
It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

replies(22): >>42193093 #>>42193139 #>>42193143 #>>42193309 #>>42193374 #>>42193380 #>>42193693 #>>42193936 #>>42194067 #>>42194113 #>>42194302 #>>42194361 #>>42194490 #>>42194544 #>>42194670 #>>42195628 #>>42196720 #>>42197192 #>>42197489 #>>42198158 #>>42199832 #>>42200060 #
paddy_m ◴[] No.42194361[source]
Have you tried polars? It’s a much more regular syntax. The regular syntax fits well with the lazy execution. It’s very composable for programmatically building queries. And then it’s super fast
replies(1): >>42195149 #
bionhoward ◴[] No.42195149[source]
I found the biggest benefit of polars is ironically the loss of the thing I thought I would miss most, the index; with pandas there are columns, indices, and multi-indices, whereas with polars, everything is a column, it’s all the same so you can delete a lot of conditionals.

However, I still find myself using pandas for the timestamps, timedeltas, and date offsets, and even still, I need a whole extra column just to hold time zones, since polars maps everything to UTC storage zone, you lose the origin / local TZ which screws up heterogeneous time zone datasets. (And I learned you really need to enforce careful manual thoughtful consideration of time zone replacement vs offsetting at the API level)

Had to write a ton of code to deal with this, I wish polars had explicit separation of local vs storage zones on the Datetime data type

replies(1): >>42196968 #
1. paddy_m ◴[] No.42196968{3}[source]
I think pandas was so ambitious syntax wise and concept wise. But it got be a bit of a jumble. The index idea in particular is so cool, particular multi-indexes, watching people who really understand it do multi index operations is very cool.

IMO Polars sets a different goal of what's the most pandas like thing that we can build that is fast (and leaves open the possibility for more optimization), and clean.

Polars feels like you are obviously manipulating an advanced query engine. Pandas feels like manipulating this squishy datastructure that should be super useful and friendly, but sometimes it does something dumb and slow