Most active commenters
  • short_sells_poo(4)
  • otsaloma(3)

←back to thread

FireDucks: Pandas but Faster

(hwisnu.bearblog.dev)
374 points sebg | 63 comments | | HN request time: 0.915s | source | bottom
1. rich_sasha ◴[] No.42193043[source]
It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

replies(22): >>42193093 #>>42193139 #>>42193143 #>>42193309 #>>42193374 #>>42193380 #>>42193693 #>>42193936 #>>42194067 #>>42194113 #>>42194302 #>>42194361 #>>42194490 #>>42194544 #>>42194670 #>>42195628 #>>42196720 #>>42197192 #>>42197489 #>>42198158 #>>42199832 #>>42200060 #
2. martinsmit ◴[] No.42193093[source]
Check out redframes[1] which provides a dplyr-like syntax and is fully interoperable with pandas.

[1]: https://github.com/maxhumber/redframes

replies(1): >>42195287 #
3. ljosifov ◴[] No.42193139[source]
+1 Seconding this. My limited experience with pandas had a non-trivial number of moments "?? Is it really like this? Nah - I'm mistaken for sure, this can not be, no one would do something insane like that". And yet and yet... Fwiw since I've found that numpy is a must (ofc), but pandas is mostly optional. So I stick to numpy for my writing, and keep pandas read only. (just execute someone else's)
replies(1): >>42193622 #
4. omnicognate ◴[] No.42193143[source]
What about the polars API doesn't work well for your use case?
replies(1): >>42193572 #
5. sega_sai ◴[] No.42193309[source]
Great point that I completely share. I tend to avoid pandas at all costs except for very simple things as I have bitten by many issues related to indexing. For anything complicated I tend to switch to duckdb instead.
replies(1): >>42193551 #
6. amelius ◴[] No.42193374[source]
Yes. Pandas turns 10x developers into .1x developers.
replies(1): >>42193785 #
7. Kalanos ◴[] No.42193380[source]
The pandas API makes a lot more sense if you are familiar with numpy.

Writing pandas code is a bit redundant. So what?

Who is to say that fireducks won't make their own API?

8. bravura ◴[] No.42193551[source]
Can you explain your use-case and why DuckDB is better?

Considering switching from pandas and want to understand what is my best bet. I am just processing feature vectors that are too large for memory, and need an initial simple JOIN to aggregate them.

replies(2): >>42193825 #>>42195551 #
9. short_sells_poo ◴[] No.42193572[source]
Polars is missing a crucial feature for replacing pandas in Finance: first class timeseries handling. Pandas allows me to easily do algebra on timeseries. I can easily resample data with the resample(...) method, I can reason about the index frequency, I can do algebra between timeseries, etc.

You can do the same with Polars, but you have to start messing about with datetimes and convert the simple problem "I want to calculate a monthly sum anchored on the last business day of the month" to SQL-like operations.

Pandas grew a large and obtuse API because it provides specialized functions for 99% of the tasks one needs to do on timeseries. If I want to calculate an exponential weighted covariance between two time series, I can trivially do this with pandas: series1.ewm(...).cov(series2). I welcome people to try and do this with Polars. It'll be a horrible and barely readable contraption.

YC is mostly populated by technologists, and technologists are often completely ignorant about what makes pandas useful and popular. It was built by quants/scientists, for doing (interactive) research. In this respect it is similar to R, which is not a language well liked by technologists, but it is (surprise) deeply loved by many scientists.

replies(3): >>42193914 #>>42193996 #>>42194053 #
10. faizshah ◴[] No.42193693[source]
Pandas is a commonly known DSL at this point so lots of data scientists know pandas like the back of their hand and thats why a lot of pandas but for X libraries have become popular.

I agree that pandas does not have the best designed api in comparison to say dplyr but it also has a lot of functionality like pivot, melt, unstack that are often not implemented by other libraries. It’s also existed for more than a decade at this point so there’s a plethora of resources and stackoverflow questions.

On top of that, these days I just use ChatGPT to generate some of my pandas tasks. ChatGPT and other coding assistants know pandas really well so it’s super easy.

But I think if you get to know Pandas after a while you just learn all the weird quirks but gain huge benefits from all the things it can do and all the other libraries you can use with it.

replies(1): >>42193743 #
11. rich_sasha ◴[] No.42193743[source]
I've been living in the shadow of pandas for about a decade now, and the only thing I learned is to avoid using it.

I 100% agree that pandas addresses all the pain points of data analysis in the wild, and this is precisely why it is so popular. My point is, it doesn't address them well. It seems like a conglomerate of special cases, written for a specific problem it's author was facing, with little concern for consistency, generality or other use cases that might arise.

In my usage, any time saved by its (very useful) methods tends to be lost on fixing subtle bugs introduced by strange pandas behaviours.

In my use cases, I reindex the data using pandas and get it to numpy arrays as soon as I can, and work with those, with a small library of utilities I wrote over the years. I'd gladly use a "sane pandas" instead.

replies(1): >>42194460 #
12. berkes ◴[] No.42193785[source]
It does to me. Well, a 1x developer into a .01x dev in my case.

My conclusion was that pandas is not for developers. But for one-offs by managers, data-scientists, scientists, and so on. And maybe for "hackers" who cludge together stuff 'till it works and then hopefully never touch it.

Which made me realize such thoughts can come over as smug, patronizing or belittling. But they do show how software can be optimized for different use-cases.

The danger then lies into not recognizing these use-cases when you pull in smth like pandas. "Maybe using panda's to map and reduce the CSVs that our users upload to insert batches isn't a good idea at all".

This is often worsened by the tools/platforms/lib devs or communities not advertising these sweet spots and limitations. Not in the case of Pandas though: that's really clear about this not being a lib or framework for devs, but a tool(kit) to do data analysis with. Kudo's for that.

replies(2): >>42194002 #>>42201883 #
13. sega_sai ◴[] No.42193825{3}[source]
I am not necessarily saying duckdb is better. I personally just found it easier, clearer to write a sql query for any complicated set of joins/group by processing than to try to do that in pandas.
14. dkga ◴[] No.42193914{3}[source]
Exactly the single reason why I use pandas when I need to use python. But coming from R, it still feels like “second best”.
15. te_chris ◴[] No.42193936[source]
Pandas best feature for me is the df format being readable by duckdb. The filtering api is a nightmare
16. n8henrie ◴[] No.42193996{3}[source]
I don't know what exponential weighted covariance is, but I've had pretty good luck converting time series-based analyses from pandas to polars (for patient presentations to my emergency department -- patients per hour, per day, per shift, etc.). Resample has a direct (and easier IMO) replacement in polars, and there is group_by_dynamic.

I've had trouble determining whether one timestamp falls between two others across tens of thousands of rows (with the polars team suggesting I use a massive cross product and filter -- which worked but excludes the memory requirement), whereas in pandas I was able to sort the timestamps and thereby only need to compare against the preceding / following few based on the index of the last match.

The other issue I've had with resampling is with polars automatically dropping time periods with zero events, giving me a null instead of zero for the count of events in certain time periods (which then gets dropped from aggregations). This has caught me a few times.

But other than that I've had good luck.

replies(2): >>42194736 #>>42198446 #
17. analog31 ◴[] No.42194002{3}[source]
I'm one of those people myself, and have whittled my Pandas use down to displaying pretty tables in Jupyter. Everything else I do in straight Numpy.
replies(1): >>42196508 #
18. marcogorelli ◴[] No.42194053{3}[source]
Could you show how you write "calculate a monthly sum anchored on the last business day of the month" in pandas please?
replies(3): >>42194368 #>>42195644 #>>42198127 #
19. egecant ◴[] No.42194067[source]
Completely agree, from the perspective of someone that primarily uses R/tidyverse for data wrangling, there is this great article on why Pandas API feel clunky: https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-...
20. movpasd ◴[] No.42194113[source]
I started using Polars for the "rapid iteration" usecase you describe, in notebooks and such, and haven't looked back — there are a few ergonomic wrinkles that I mostly attribute to the newness of the library, but I found that polars forces me to structure my thought process and ask myself "what am I actually trying to do here?".

I find I basically never write myself into a corner with initially expedient but ultimately awkward data structures like I often did with pandas, the expression API makes the semantics a lot clearer, and I don't have to "guess" the API nearly as much.

So even for this usecase, I would recommend trying out polars for anyone reading this and seeing how it feels after the initial learning phase is over.

21. h14h ◴[] No.42194302[source]
If you wanna try a different API, take a look at Elixir Explorer:

https://hexdocs.pm/explorer/exploring_explorer.html

It runs on top of Polars so you get those speed gains, but uses the Elixir programming language. This gives the benefit of a simple finctional syntax w/ pipelines & whatnot.

It also benefits from the excellent Livebook (a Jupyter alternative specific to Elixir) ecosystem, which provides all kinds of benefits.

22. paddy_m ◴[] No.42194361[source]
Have you tried polars? It’s a much more regular syntax. The regular syntax fits well with the lazy execution. It’s very composable for programmatically building queries. And then it’s super fast
replies(1): >>42195149 #
23. sebg ◴[] No.42194368{4}[source]
Not OP.

But I'm guessing it's something like this:

import pandas as pd

def calculate_monthly_business_sum(df, date_column, value_column):

    """
    Calculate monthly sums anchored to the last business day of each month

    Parameters:
    df: DataFrame with dates and values
    date_column: name of date column
    value_column: name of value column to sum
    
    Returns:
    DataFrame with sums anchored to last business day
    """
    # Ensure date column is datetime
    df[date_column] = pd.to_datetime(df[date_column])
    
    # Group by end of business month and sum
    monthly_sum = df.groupby(pd.Grouper(
        key=date_column,
        freq='BME'  # Business Month End frequency
    ))[value_column].sum().reset_index()

    return monthly_sum
# Example usage:

df = pd.DataFrame({ 'date': ['2024-01-01', '2024-01-31', '2024-02-29'], 'amount': [100, 200, 300] })

result = calculate_monthly_business_sum(df, 'date', 'amount')

print(result)

Which you can run here => https://python-fiddle.com/examples/pandas?checkpoint=1732114...

replies(1): >>42194715 #
24. specproc ◴[] No.42194460{3}[source]
Aye, but we've learned it, we've got code bases written in it, many of us are much more data kids than "real devs".

I get it doesn't follow best practices, but it does do what it needs to. Speed has been an issue, and it's exciting seeing that problem being solved.

Interesting to see so many people recently saying "polars looks great, but no way I'll rewrite". This library seems to give a lot of people, myself included, exactly what we want. I look forward to trying it.

25. stared ◴[] No.42194490[source]
Yes, every time I write df[df.sth = val], a tiny part of me dies.

For a comparison, dplyr offers a lot of elegant functionality, and the functional approach in Pandas often feels like an afterthought. If R is cleaner than Python, it tells a lot (as a side note: the same story for ggplot2 and matplotlib).

Another surprise for friends coming from non-Python backgrounds is the lack of column-level type enforcement. You write df.loc[:, "col1"] and hope it works, with all checks happening at runtime. It would be amazing if Pandas integrated something like Pydantic out of the box.

I still remember when Pandas first came out—it was fantastic to have a tool that replaced hand-rolled data structures using NumPy arrays and column metadata. But that was quite a while ago, and the ecosystem has evolved rapidly since then, including Python’s gradual shift toward type checking.

replies(3): >>42195076 #>>42197375 #>>42202116 #
26. otsaloma ◴[] No.42194544[source]
Agreed, never had a problem with the speed of anything NumPy or Arrow based.

Here's my alternative: https://github.com/otsaloma/dataiter https://dataiter.readthedocs.io/en/latest/_static/comparison...

Planning to switch to NumPy 2.0 strings soon. Other than that I feel all the basic operations are fine and solid.

Note for anyone else rolling up their sleeves: You can get quite far with pure Python when building on top of NumPy (or maybe Arrow). The only thing I found needing more performance was group-by-aggregate, where Numba seems to work OK, although a bit difficult as a dependency.

27. adolph ◴[] No.42194670[source]
So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions

Yeah, Pandas has that early PHP feel to it, probably out of being a successful first mover.

28. short_sells_poo ◴[] No.42194715{5}[source]
It's actually much simpler than that. Assuming the index of the dataframe DF is composed of timestamps (which is normal for timeseries):

df.resample("BME").sum()

Done. One line of code and it is quite obvious what it is doing - with perhaps the small exception of BME, but if you want max readability you could do:

df.resample(pd.offsets.BusinessMonthEnd()).sum()

This is why people use pandas.

29. short_sells_poo ◴[] No.42194736{4}[source]
I'm curious how is polars group_by_dynamic easier than resample in pandas. In pandas if I want to resample to a monthly frequency anchored to the last business day of the month, I'd write:

> my_df.resample("BME").apply(...)

Done. I don't think it gets any easier than this. Every time I tried something similar with polars, I got bogged down in calendar treatment hell and large and obscure SQL like contraptions.

Edit: original tone was unintentionally combative - apologies.

30. oreilles ◴[] No.42195076[source]
> Yes, every time I write df[df.sth = val], a tiny part of me dies.

That's because it's a bad way to use Pandas, even though it is the most popular and often times recommended way. But the thing is, you can just write "safe" immutable Pandas code with method chaining and lambda expressions, resulting in very Polars-like code. For example:

  df = (
    pd
    .read_csv("./file.csv")
    .rename(columns={"value":"x"})
    .assign(y=lambda d: d["x"] * 2)
    .loc[lambda d: d["y"] > 0.5]
  )
Plus nowadays with the latest Pandas versions supporting Arrow datatypes, Polars performance improvements over Pandas are considerably less impressive.

Column-level name checking would be awesome, but unfortunately no python library supports that, and it will likely never be possible unless some big changes are made in the Python type hint system.

replies(4): >>42195381 #>>42195401 #>>42195717 #>>42198220 #
31. bionhoward ◴[] No.42195149[source]
I found the biggest benefit of polars is ironically the loss of the thing I thought I would miss most, the index; with pandas there are columns, indices, and multi-indices, whereas with polars, everything is a column, it’s all the same so you can delete a lot of conditionals.

However, I still find myself using pandas for the timestamps, timedeltas, and date offsets, and even still, I need a whole extra column just to hold time zones, since polars maps everything to UTC storage zone, you lose the origin / local TZ which screws up heterogeneous time zone datasets. (And I learned you really need to enforce careful manual thoughtful consideration of time zone replacement vs offsetting at the API level)

Had to write a ton of code to deal with this, I wish polars had explicit separation of local vs storage zones on the Datetime data type

replies(1): >>42196968 #
32. otsaloma ◴[] No.42195287[source]
Building on top of Pandas feels like you're only escaping part of the problems. In addition to the API, the datatypes in Pandas are a mess, with multiple confusing (and none of them good) options for e.g. dates/datetimes. Does redframes do anything there?
33. OutOfHere ◴[] No.42195381{3}[source]
Using `lambda` without care is dangerous because it risks being not vectorized at all. It risks being super slow, operating one row at a time. Is `d` a single row or the entire series or the entire dataframe?
replies(1): >>42195423 #
34. rogue7 ◴[] No.42195401{3}[source]
Agreed 100%. I am using this method-chaining style all the time and it works like a charm.
35. rogue7 ◴[] No.42195423{4}[source]
In this case `d` is the entire dataframe. It's just a way of "piping" the object without having to rename it.

You are probably thinking about `df.apply(lambda row: ..., axis=1)` which operates on each row at a time and is indeed very slow since it's not vectorized. Here this is different and vectorized.

replies(2): >>42195757 #>>42196869 #
36. rapatel0 ◴[] No.42195551{3}[source]
Look into [Ibis](https://ibis-project.org/). It's a dataframe library built on duckdb. It supports lazy execution, greater than memory datastructures, remote s3 data and is insanely fast. Also works with basically any backend (postgres, mysql, parquet/csv files, etc) though there are some implementation gaps in places.

I previously had a pandas+sklearn transformation stack that would take up to 8 hours. Converted it to ibis and it executes in about 4 minutes now and doesn't fill up RAM.

It's not a perfect apples to apples pandas replacement but really a nice layer on top of sql. after learning it, I'm almost as fast as I was on pandas with expressions.

replies(1): >>42196216 #
37. wodenokoto ◴[] No.42195628[source]
In that case I’d recommend dplyr in R. It also integrates with a better plotting library, GGPlot, which not only gives you better API than matplotlib but also prettier plots (unless you really get to work at your matplot code)
38. short_sells_poo ◴[] No.42195644{4}[source]
Answered the child comment but let me copy paste here too. It's literally one (short) line:

> df.resample("BME").sum()

Assuming `df` is a dataframe (ie table) indexed by a timestamp index, which is usual for timeseries analysis.

"BME" stands for BusinessMonthEnd, which you can type out if you want the code to be easier to read by someone not familiar with pandas.

replies(1): >>42199391 #
39. wodenokoto ◴[] No.42195717{3}[source]
I’m not really sure why you think

    .loc[lambda d: d["y"] > 0.5]
Is stylistically superior to

    [df.y > 0.5]
I agree it comes in handy quite often, but that still doesn’t make it great to write compared to what sql or dplyr offers in terms of choosing columns to filter on (`where y > 0.5`, for sql and `filter(y > 0.5)`, for dplyr)
replies(3): >>42195824 #>>42196070 #>>42197641 #
40. OutOfHere ◴[] No.42195757{5}[source]
That's excellent.
41. oreilles ◴[] No.42195824{4}[source]
It is superior because you don't need to assign your dataframe to a variable ('df'), then update that variable or create a new one everytime you need to do that operation. Which means it is both safer (you're guaranteed to filter on the current version of the dataframe) and more concise.

For the rest of your comment: it's the best you can do in python. Sure you could write SQL, but then you're mixing text queries with python data manipulation and I would dread that. And SQL-only scripting is really out of question.

replies(1): >>42196789 #
42. ◴[] No.42196070{4}[source]
43. techwizrd ◴[] No.42196216{4}[source]
I made the switch to Ibis a few months ago and have been really enjoying it. It works with all the plotting libraries including seaborn and plotnine. And it makes switching from testing on a CSV to running on a SQL/Spark a one-line change. It's just really handy for analysis (similar to the tidyverse).
44. theLiminator ◴[] No.42196508{4}[source]
Imo numpy is not better than pandas for the things you'd use pandas for, though polars is far superior.
45. epistasis ◴[] No.42196720[source]
Have you examined siuba at all? It promises to be more similar to the R tidyverse, which IMHO has a much better API. And I personally prefer dplyr/tidyverse to Polars for exploratory analysis.

https://siuba.org

I have not yet used siuba, but would be interested in others' opinions. The activation energy to learn a new set of tools is so large that I rarely have the time to fully examine this space...

replies(2): >>42197228 #>>42198236 #
46. chaps ◴[] No.42196789{5}[source]
Eh, SQL and python can still work together very well where SQL takes the place of pandas. Doing things in waves/batch helps.

Big problem with pandas is that you still have to load the dataframe into memory to work with it. My data's too big for that and postgres makes that problem go away almost entirely.

47. almostkorean ◴[] No.42196869{5}[source]
Appreciate the explanation, this is something I should know by now but don't
48. paddy_m ◴[] No.42196968{3}[source]
I think pandas was so ambitious syntax wise and concept wise. But it got be a bit of a jumble. The index idea in particular is so cool, particular multi-indexes, watching people who really understand it do multi index operations is very cool.

IMO Polars sets a different goal of what's the most pandas like thing that we can build that is fast (and leaves open the possibility for more optimization), and clean.

Polars feels like you are obviously manipulating an advanced query engine. Pandas feels like manipulating this squishy datastructure that should be super useful and friendly, but sometimes it does something dumb and slow

49. kussenverboten ◴[] No.42197192[source]
Agree with this. My favorite syntax is the elegance of data.table API in R. This should be possible in Python too someday.
50. Bootvis ◴[] No.42197228[source]
The lack of non standard evaluation still forces you to write `_.` so this might be a better Pandas but not a better tidyverse.

A pity their compares don’t have tidyverse or R’s data.table. I think R would look simpler but now it remains unclear.

51. doctorpangloss ◴[] No.42197375[source]
All I want is for the IDE and Python to correctly infer types and column names for all of these array objects. 99% of the pain for me is navigating around SQL return values and CSVs as pieces of text instead of code.
52. nathan_compton ◴[] No.42197489[source]
Yeah. Pandas is the worst. Polars is better in some ways but so verbose!
53. __mharrison__ ◴[] No.42197641{4}[source]
It's superior because it is safer. Not because the API (or requirement for using Lambda) looks better. The lambda allows the operation to work on the current state of the dataframe in the chained operation rather than the original dataframe. Alternatively, you could use .query("y > 0.5"). This also works on the current state of the dataframe.

(I'm the first to complain about the many warts in Pandas. Have written multiple books about it. This is annoying, but it is much better than [df.y > 0.5].)

54. tomrod ◴[] No.42198127{4}[source]
A bit from memory as in transit, but something like df.groupby(df[date_col]+pd.offsets.MonthEnd(0))[agg_col].sum()
55. fluorinerocket ◴[] No.42198158[source]
Thank you I don't know why people think it's so amazing. I end up sometimes just extracting the numpy arrays from the data frame and doing things like I know how to, because the Panda way is so difficult
56. moomin ◴[] No.42198220{3}[source]
I mean, yes there’s arrow data types, but it’s got a long way to go before it’s got full parity with the numpy version.
57. otsaloma ◴[] No.42198236[source]
I think the choice of using functions instead of classes + methods doesn't really fit well into Python. Either you need to do a huge amount of imports or use the awful `from siuba import *`. This feels like shoehorning the dplyr syntax into Python when method chaining would be more natural and would still retain the idea.

Also, having (already a while ago) looked at the implementation of the magic `_` object, it seemed like an awful hack that will serve only a part of use cases. Maybe someone can correct me if I'm wrong, but I get the impression you can do e.g. `summarize(x=_.x.mean())` but not `summarize(x=median(_.x))`. I'm guessing you don't get autocompletion in your editor or useful error messages and it can then get painful using this kind of a magic.

58. cmdlineluser ◴[] No.42198446{4}[source]
> cross product and filter

`.join_where()`[1] was also added recently.

[1]: https://docs.pola.rs/api/python/stable/reference/dataframe/a...

59. ies7 ◴[] No.42199391{5}[source]
This one liner example is one of the reason why some people use pandas and some people despise it.

It so easy for my analyst team because of daily uses but my developers probavly will never thought/know BME and decided to implement the code again.

60. stainablesteel ◴[] No.42199832[source]
i fell on dark days when they changed the multiindex reference level=N, which worked perfectly and was so logical and could be input alongside the axis, was swapped out in favor of a separate call for groupby
61. randomuser45678 ◴[] No.42200060[source]
Check out https://ibis-project.org/
62. fastasucan ◴[] No.42201883{3}[source]
>My conclusion was that pandas is not for developers. But for one-offs by managers, data-scientists, scientists, and so on. And maybe for "hackers" who cludge together stuff 'till it works and then hopefully never touch it.

It doesn't work for me so it can't work for anyone?

63. bdjsiqoocwk ◴[] No.42202116[source]
Nonsense, if you understand why df[df.sh ==val] you'll see it's great. If you don't, you can also do df.query("sh == val").