Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.
I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.
[1] https://www.youtube.com/watch?v=PFUZlNQIndo [2] https://www.malloydata.dev/
That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.
Disclaimer: I work for Polars on said query execution.
I have used Polars to process 600M of xml files (with a bit of a hack) and the polars part of the code is readable with minimal comments.
Polars has a better api than pandas, at least the intent is easier to understand. (lazyness, yay)
This article explains it pretty well: https://dynomight.net/numpy/
Take two examples of dataframe apis, dplyr and ibis. Both can run on a range of SQL backends because dataframe apis are very similar to SQL DML apis.
Moreover, the SQL translation for tools for pivot_longer in R are a good illustration of complex dynamics dataframe apis can support, that you'd use something like dbt to implement in your SQL models. duckdb allows dynamic column selection in unpivot. But in some SQL dialects this is impossible. dataframe apis -> SQL tools (or dbt) enable them in these dialects.