Most active commenters

short_sells_poo(4)

Popular/hot comments

>>42194053 #

←back to thread

FireDucks: Pandas but Faster

(hwisnu.bearblog.dev)

Show context

rich_sasha ◴[20 Nov 24 11:56 UTC] No.42193043[source]▶

>>42135303 (OP) #

It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

replies(22): >>42193093 #>>42193139 #>>42193143 #>>42193309 #>>42193374 #>>42193380 #>>42193693 #>>42193936 #>>42194067 #>>42194113 #>>42194302 #>>42194361 #>>42194490 #>>42194544 #>>42194670 #>>42195628 #>>42196720 #>>42197192 #>>42197489 #>>42198158 #>>42199832 #>>42200060 #

omnicognate ◴[20 Nov 24 12:07 UTC] No.42193143[source]▶

>>42193043 #

What about the polars API doesn't work well for your use case?

replies(1): >>42193572 #

1. short_sells_poo ◴[20 Nov 24 13:02 UTC] No.42193572[source]▶

>>42193143 #

Polars is missing a crucial feature for replacing pandas in Finance: first class timeseries handling. Pandas allows me to easily do algebra on timeseries. I can easily resample data with the resample(...) method, I can reason about the index frequency, I can do algebra between timeseries, etc.

You can do the same with Polars, but you have to start messing about with datetimes and convert the simple problem "I want to calculate a monthly sum anchored on the last business day of the month" to SQL-like operations.

Pandas grew a large and obtuse API because it provides specialized functions for 99% of the tasks one needs to do on timeseries. If I want to calculate an exponential weighted covariance between two time series, I can trivially do this with pandas: series1.ewm(...).cov(series2). I welcome people to try and do this with Polars. It'll be a horrible and barely readable contraption.

YC is mostly populated by technologists, and technologists are often completely ignorant about what makes pandas useful and popular. It was built by quants/scientists, for doing (interactive) research. In this respect it is similar to R, which is not a language well liked by technologists, but it is (surprise) deeply loved by many scientists.

replies(4): >>42193914 #>>42193996 #>>42194053 #>>42210509 #

2. dkga ◴[20 Nov 24 13:49 UTC] No.42193914[source]▶

>>42193572 (TP) #

Exactly the single reason why I use pandas when I need to use python. But coming from R, it still feels like “second best”.

3. n8henrie ◴[20 Nov 24 14:01 UTC] No.42193996[source]▶

>>42193572 (TP) #

I don't know what exponential weighted covariance is, but I've had pretty good luck converting time series-based analyses from pandas to polars (for patient presentations to my emergency department -- patients per hour, per day, per shift, etc.). Resample has a direct (and easier IMO) replacement in polars, and there is group_by_dynamic.

I've had trouble determining whether one timestamp falls between two others across tens of thousands of rows (with the polars team suggesting I use a massive cross product and filter -- which worked but excludes the memory requirement), whereas in pandas I was able to sort the timestamps and thereby only need to compare against the preceding / following few based on the index of the last match.

The other issue I've had with resampling is with polars automatically dropping time periods with zero events, giving me a null instead of zero for the count of events in certain time periods (which then gets dropped from aggregations). This has caught me a few times.

But other than that I've had good luck.

replies(2): >>42194736 #>>42198446 #

4. marcogorelli ◴[20 Nov 24 14:09 UTC] No.42194053[source]▶

>>42193572 (TP) #

Could you show how you write "calculate a monthly sum anchored on the last business day of the month" in pandas please?

replies(3): >>42194368 #>>42195644 #>>42198127 #

5. sebg ◴[20 Nov 24 14:50 UTC] No.42194368[source]▶

>>42194053 #

Not OP.

But I'm guessing it's something like this:

import pandas as pd

def calculate_monthly_business_sum(df, date_column, value_column):

    """
    Calculate monthly sums anchored to the last business day of each month

    Parameters:
    df: DataFrame with dates and values
    date_column: name of date column
    value_column: name of value column to sum
    
    Returns:
    DataFrame with sums anchored to last business day
    """
    # Ensure date column is datetime
    df[date_column] = pd.to_datetime(df[date_column])
    
    # Group by end of business month and sum
    monthly_sum = df.groupby(pd.Grouper(
        key=date_column,
        freq='BME'  # Business Month End frequency
    ))[value_column].sum().reset_index()

    return monthly_sum

# Example usage:

df = pd.DataFrame({ 'date': ['2024-01-01', '2024-01-31', '2024-02-29'], 'amount': [100, 200, 300] })

result = calculate_monthly_business_sum(df, 'date', 'amount')

print(result)

Which you can run here => https://python-fiddle.com/examples/pandas?checkpoint=1732114...

replies(1): >>42194715 #

6. short_sells_poo ◴[20 Nov 24 15:27 UTC] No.42194715{3}[source]▶

>>42194368 #

It's actually much simpler than that. Assuming the index of the dataframe DF is composed of timestamps (which is normal for timeseries):

df.resample("BME").sum()

Done. One line of code and it is quite obvious what it is doing - with perhaps the small exception of BME, but if you want max readability you could do:

df.resample(pd.offsets.BusinessMonthEnd()).sum()

This is why people use pandas.

7. short_sells_poo ◴[20 Nov 24 15:29 UTC] No.42194736[source]▶

>>42193996 #

I'm curious how is polars group_by_dynamic easier than resample in pandas. In pandas if I want to resample to a monthly frequency anchored to the last business day of the month, I'd write:

> my_df.resample("BME").apply(...)

Done. I don't think it gets any easier than this. Every time I tried something similar with polars, I got bogged down in calendar treatment hell and large and obscure SQL like contraptions.

Edit: original tone was unintentionally combative - apologies.

replies(1): >>42204709 #

8. short_sells_poo ◴[20 Nov 24 16:37 UTC] No.42195644[source]▶

>>42194053 #

Answered the child comment but let me copy paste here too. It's literally one (short) line:

> df.resample("BME").sum()

Assuming `df` is a dataframe (ie table) indexed by a timestamp index, which is usual for timeseries analysis.

"BME" stands for BusinessMonthEnd, which you can type out if you want the code to be easier to read by someone not familiar with pandas.

replies(1): >>42199391 #

9. tomrod ◴[20 Nov 24 21:09 UTC] No.42198127[source]▶

>>42194053 #

A bit from memory as in transit, but something like df.groupby(df[date_col]+pd.offsets.MonthEnd(0))[agg_col].sum()

10. cmdlineluser ◴[20 Nov 24 21:46 UTC] No.42198446[source]▶

>>42193996 #

> cross product and filter

`.join_where()`[1] was also added recently.

[1]: https://docs.pola.rs/api/python/stable/reference/dataframe/a...

11. ies7 ◴[20 Nov 24 23:45 UTC] No.42199391{3}[source]▶

>>42195644 #

This one liner example is one of the reason why some people use pandas and some people despise it.

It so easy for my analyst team because of daily uses but my developers probavly will never thought/know BME and decided to implement the code again.

12. n8henrie ◴[21 Nov 24 14:38 UTC] No.42204709{3}[source]▶

>>42194736 #

Totally fair. And thank you for the rewording (sincerely). I haven't used polars for anything business or finance related, so this is likely one of many blind spots for me.

Reviewing my work, only needed an hourly aggregation, which was similarly easy in polars and pandas (I misspoke about being easier) -- what I found easier was grouping by time data that wasn't amenable to `resample`.

In polars I had no problems using a regular group_by with a pl.col.dt object, whereas in pandas I remember struggling to do so, even though it seemed straightforward.

Sorry, I wish I could remember more details; this was probably 5 years ago that I was writing the pandas code and just converted it to polars about a year ago, so it's possible that I just got better at python in the meantime (though I was writing much more python back then). And of course a rewrite is likely to feel easier the second time.

The other confounding issue is that the eager pandas code crashed with OOM regularly and took several minutes to run, whereas polars handles it very well (which I'm sure to some degree is it optimizing things that I could have done manually), but this made iterating on this codebase feel much less onerous.

13. bobbylarrybobby ◴[22 Nov 24 02:00 UTC] No.42210509[source]▶

>>42193572 (TP) #

Is LazyFrame.group_by_dynamic not basically the same thing?

↑