Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.
I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.
[1] https://www.youtube.com/watch?v=PFUZlNQIndo [2] https://www.malloydata.dev/
That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.
Disclaimer: I work for Polars on said query execution.
I have used Polars to process 600M of xml files (with a bit of a hack) and the polars part of the code is readable with minimal comments.
Polars has a better api than pandas, at least the intent is easier to understand. (lazyness, yay)
This article explains it pretty well: https://dynomight.net/numpy/
The original comment I responded to was confusing Pandas with Polars, and now your blog post refers to Numpy, but Polars takes a completely different approach to dataframes/data processing than either of these tools.