Polars Cloud and Distributed Polars now available

(pola.rs)

183 points jonbaer | 1 comments | 04 Sep 25 03:01 UTC | HN request time: 0s | source

Show context

drej ◴[04 Sep 25 10:51 UTC] No.45125792[source]▶

Having done a bit of data engineering in my day, I'm growing more and more allergic to the DataFrame API (which I used 24/7 for years). From what I've seen over the past ~10 years, 90+% of use cases would be better served by SQL, both from the development perspective as well as debugging, onboarding, sharing, migrating etc.

Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.

replies(7): >>45125845 #>>45126294 #>>45127389 #>>45127993 #>>45128144 #>>45128518 #>>45134858 #

mrtimo ◴[04 Sep 25 14:01 UTC] No.45127389[source]▶

>>45125792 #

I agree with this 100%. The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here [1].

I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.

I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.

[1] https://www.youtube.com/watch?v=PFUZlNQIndo [2] https://www.malloydata.dev/

replies(3): >>45127742 #>>45128223 #>>45128330 #

orlp ◴[04 Sep 25 15:24 UTC] No.45128330[source]▶

>>45127389 #

> The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here.

That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.

Disclaimer: I work for Polars on said query execution.

replies(2): >>45130503 #>>45132205 #

phailhaus ◴[04 Sep 25 21:08 UTC] No.45132205[source]▶

>>45128330 #

The DataFrame interface itself is the problem. It's incredibly hard to read, write, debug, and test. Too much work has gone into reducing keystrokes rather than developing a better tool.

replies(3): >>45132720 #>>45133527 #>>45138419 #

dev_l1x_be ◴[04 Sep 25 22:02 UTC] No.45132720[source]▶

>>45132205 #

Not sure what you mean by this. The table concept is the same age as computers. Here is a table, do something with it -> this is the high level df api. All the functions make sense, what is hard to read, write or debug here?

I have used Polars to process 600M of xml files (with a bit of a hack) and the polars part of the code is readable with minimal comments.

Polars has a better api than pandas, at least the intent is easier to understand. (lazyness, yay)

replies(1): >>45133217 #

phailhaus ◴[04 Sep 25 23:01 UTC] No.45133217[source]▶

>>45132720 #

The problem with the dataframe API is that whenever you want to change a small part of your logic, you usually have to rethink and rewrite the whole solution. It is too difficult to write reusable code. Too many functions that try to do too many things with a million kwargs that each have their own nuances. This is because these libraries tend to favor fewer keystrokes over composable design. So the easy stuff is easy and makes for pretty docs, but the hard stuff is obnoxious to reason through.

This article explains it pretty well: https://dynomight.net/numpy/

replies(2): >>45133479 #>>45134568 #

1. closed ◴[05 Sep 25 02:54 UTC] No.45134568{3}[source]▶

>>45133217 #

I have used numpy, but don't understand what it has to do with dataframe apis

Take two examples of dataframe apis, dplyr and ibis. Both can run on a range of SQL backends because dataframe apis are very similar to SQL DML apis.

Moreover, the SQL translation for tools for pivot_longer in R are a good illustration of complex dynamics dataframe apis can support, that you'd use something like dbt to implement in your SQL models. duckdb allows dynamic column selection in unpivot. But in some SQL dialects this is impossible. dataframe apis -> SQL tools (or dbt) enable them in these dialects.

↑