FireDucks: Pandas but Faster

(hwisnu.bearblog.dev)

398 points sebg | 4 comments | 14 Nov 24 11:48 UTC | HN request time: 0.628s | source

Show context

rich_sasha ◴[20 Nov 24 11:56 UTC] No.42193043[source]▶

It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

replies(22): >>42193093 #>>42193139 #>>42193143 #>>42193309 #>>42193374 #>>42193380 #>>42193693 #>>42193936 #>>42194067 #>>42194113 #>>42194302 #>>42194361 #>>42194490 #>>42194544 #>>42194670 #>>42195628 #>>42196720 #>>42197192 #>>42197489 #>>42198158 #>>42199832 #>>42200060 #

stared ◴[20 Nov 24 15:03 UTC] No.42194490[source]▶

>>42193043 #

Yes, every time I write df[df.sth = val], a tiny part of me dies.

For a comparison, dplyr offers a lot of elegant functionality, and the functional approach in Pandas often feels like an afterthought. If R is cleaner than Python, it tells a lot (as a side note: the same story for ggplot2 and matplotlib).

Another surprise for friends coming from non-Python backgrounds is the lack of column-level type enforcement. You write df.loc[:, "col1"] and hope it works, with all checks happening at runtime. It would be amazing if Pandas integrated something like Pydantic out of the box.

I still remember when Pandas first came out—it was fantastic to have a tool that replaced hand-rolled data structures using NumPy arrays and column metadata. But that was quite a while ago, and the ecosystem has evolved rapidly since then, including Python’s gradual shift toward type checking.

replies(3): >>42195076 #>>42197375 #>>42202116 #

oreilles ◴[20 Nov 24 15:56 UTC] No.42195076[source]▶

>>42194490 #

> Yes, every time I write df[df.sth = val], a tiny part of me dies.

That's because it's a bad way to use Pandas, even though it is the most popular and often times recommended way. But the thing is, you can just write "safe" immutable Pandas code with method chaining and lambda expressions, resulting in very Polars-like code. For example:

  df = (
    pd
    .read_csv("./file.csv")
    .rename(columns={"value":"x"})
    .assign(y=lambda d: d["x"] * 2)
    .loc[lambda d: d["y"] > 0.5]
  )

Plus nowadays with the latest Pandas versions supporting Arrow datatypes, Polars performance improvements over Pandas are considerably less impressive.

Column-level name checking would be awesome, but unfortunately no python library supports that, and it will likely never be possible unless some big changes are made in the Python type hint system.

replies(4): >>42195381 #>>42195401 #>>42195717 #>>42198220 #

1. OutOfHere ◴[20 Nov 24 16:16 UTC] No.42195381[source]▶

>>42195076 #

Using `lambda` without care is dangerous because it risks being not vectorized at all. It risks being super slow, operating one row at a time. Is `d` a single row or the entire series or the entire dataframe?

replies(1): >>42195423 #

2. rogue7 ◴[20 Nov 24 16:20 UTC] No.42195423[source]▶

>>42195381 (TP) #

In this case `d` is the entire dataframe. It's just a way of "piping" the object without having to rename it.

You are probably thinking about `df.apply(lambda row: ..., axis=1)` which operates on each row at a time and is indeed very slow since it's not vectorized. Here this is different and vectorized.

replies(2): >>42195757 #>>42196869 #

3. OutOfHere ◴[20 Nov 24 16:48 UTC] No.42195757[source]▶

>>42195423 #

That's excellent.

4. almostkorean ◴[20 Nov 24 18:41 UTC] No.42196869[source]▶

>>42195423 #

Appreciate the explanation, this is something I should know by now but don't

↑