←back to thread

FireDucks: Pandas but Faster

(hwisnu.bearblog.dev)
374 points sebg | 2 comments | | HN request time: 0.42s | source
Show context
rich_sasha ◴[] No.42193043[source]
It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

replies(22): >>42193093 #>>42193139 #>>42193143 #>>42193309 #>>42193374 #>>42193380 #>>42193693 #>>42193936 #>>42194067 #>>42194113 #>>42194302 #>>42194361 #>>42194490 #>>42194544 #>>42194670 #>>42195628 #>>42196720 #>>42197192 #>>42197489 #>>42198158 #>>42199832 #>>42200060 #
faizshah ◴[] No.42193693[source]
Pandas is a commonly known DSL at this point so lots of data scientists know pandas like the back of their hand and thats why a lot of pandas but for X libraries have become popular.

I agree that pandas does not have the best designed api in comparison to say dplyr but it also has a lot of functionality like pivot, melt, unstack that are often not implemented by other libraries. It’s also existed for more than a decade at this point so there’s a plethora of resources and stackoverflow questions.

On top of that, these days I just use ChatGPT to generate some of my pandas tasks. ChatGPT and other coding assistants know pandas really well so it’s super easy.

But I think if you get to know Pandas after a while you just learn all the weird quirks but gain huge benefits from all the things it can do and all the other libraries you can use with it.

replies(1): >>42193743 #
1. rich_sasha ◴[] No.42193743[source]
I've been living in the shadow of pandas for about a decade now, and the only thing I learned is to avoid using it.

I 100% agree that pandas addresses all the pain points of data analysis in the wild, and this is precisely why it is so popular. My point is, it doesn't address them well. It seems like a conglomerate of special cases, written for a specific problem it's author was facing, with little concern for consistency, generality or other use cases that might arise.

In my usage, any time saved by its (very useful) methods tends to be lost on fixing subtle bugs introduced by strange pandas behaviours.

In my use cases, I reindex the data using pandas and get it to numpy arrays as soon as I can, and work with those, with a small library of utilities I wrote over the years. I'd gladly use a "sane pandas" instead.

replies(1): >>42194460 #
2. specproc ◴[] No.42194460[source]
Aye, but we've learned it, we've got code bases written in it, many of us are much more data kids than "real devs".

I get it doesn't follow best practices, but it does do what it needs to. Speed has been an issue, and it's exciting seeing that problem being solved.

Interesting to see so many people recently saying "polars looks great, but no way I'll rewrite". This library seems to give a lot of people, myself included, exactly what we want. I look forward to trying it.