FireDucks: Pandas but Faster

(hwisnu.bearblog.dev)

398 points sebg | 3 comments | 14 Nov 24 11:48 UTC | HN request time: 0.468s | source

Show context

rich_sasha ◴[20 Nov 24 11:56 UTC] No.42193043[source]▶

It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

replies(22): >>42193093 #>>42193139 #>>42193143 #>>42193309 #>>42193374 #>>42193380 #>>42193693 #>>42193936 #>>42194067 #>>42194113 #>>42194302 #>>42194361 #>>42194490 #>>42194544 #>>42194670 #>>42195628 #>>42196720 #>>42197192 #>>42197489 #>>42198158 #>>42199832 #>>42200060 #

1. epistasis ◴[20 Nov 24 18:22 UTC] No.42196720[source]▶

>>42193043 #

Have you examined siuba at all? It promises to be more similar to the R tidyverse, which IMHO has a much better API. And I personally prefer dplyr/tidyverse to Polars for exploratory analysis.

https://siuba.org

I have not yet used siuba, but would be interested in others' opinions. The activation energy to learn a new set of tools is so large that I rarely have the time to fully examine this space...

replies(2): >>42197228 #>>42198236 #

2. Bootvis ◴[20 Nov 24 19:23 UTC] No.42197228[source]▶

>>42196720 (TP) #

The lack of non standard evaluation still forces you to write `_.` so this might be a better Pandas but not a better tidyverse.

A pity their compares don’t have tidyverse or R’s data.table. I think R would look simpler but now it remains unclear.

3. otsaloma ◴[20 Nov 24 21:20 UTC] No.42198236[source]▶

>>42196720 (TP) #

I think the choice of using functions instead of classes + methods doesn't really fit well into Python. Either you need to do a huge amount of imports or use the awful `from siuba import *`. This feels like shoehorning the dplyr syntax into Python when method chaining would be more natural and would still retain the idea.

Also, having (already a while ago) looked at the implementation of the magic `_` object, it seemed like an awful hack that will serve only a part of use cases. Maybe someone can correct me if I'm wrong, but I get the impression you can do e.g. `summarize(x=_.x.mean())` but not `summarize(x=median(_.x))`. I'm guessing you don't get autocompletion in your editor or useful error messages and it can then get painful using this kind of a magic.

↑