Polars Cloud and Distributed Polars now available

(pola.rs)

183 points jonbaer | 1 comments | 04 Sep 25 03:01 UTC | HN request time: 0.248s | source

Show context

drej ◴[04 Sep 25 10:51 UTC] No.45125792[source]▶

Having done a bit of data engineering in my day, I'm growing more and more allergic to the DataFrame API (which I used 24/7 for years). From what I've seen over the past ~10 years, 90+% of use cases would be better served by SQL, both from the development perspective as well as debugging, onboarding, sharing, migrating etc.

Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.

replies(7): >>45125845 #>>45126294 #>>45127389 #>>45127993 #>>45128144 #>>45128518 #>>45134858 #

robertkoss ◴[04 Sep 25 11:01 UTC] No.45125845[source]▶

>>45125792 #

That is a false dichotomy. You can use SQL tools but still have to choose the instance type.

Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.

replies(2): >>45125950 #>>45126643 #

drej ◴[04 Sep 25 11:16 UTC] No.45125950[source]▶

>>45125845 #

As a user? No, I don't have to choose. What I'm saying is that analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies, JVM versions, cross-AZ pricing etc. In most cases, they should just get a connection string and/or a web UI to run their queries, everything abstracted from them.

Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.

replies(6): >>45126006 #>>45126053 #>>45126201 #>>45126376 #>>45126648 #>>45132599 #

robertkoss ◴[04 Sep 25 11:27 UTC] No.45126053[source]▶

>>45125950 #

You were talking about data engineering. If you do not write tests as a data engineer what are you doing then? Just hoping that you don't fuck up editing a 1000 > line SQL script?

If you use Athena you still have to worry about shuffling and joining, it is just hidden.. It is Trino / Presto under the hood and if you click explain you can see the execution plan, which is essentially the same as looking into the SparkUI.

Who cares about JVM versions nowadays? No one is hosting Spark themselves.

Literally every tool now supports DataFrame AND SQL APIs and to me there is no reason to pick up SQL if you are familiar with a little bit of Python

replies(2): >>45126253 #>>45127783 #

1. datadrivenangel ◴[04 Sep 25 14:36 UTC] No.45127783[source]▶

>>45126053 #

Way too many data engineers are running in clown mode just eyeballing the results of 1000 line SQL scripts....

https://ludic.mataroa.blog/blog/get-me-out-of-data-hell/

↑