Most active commenters

drej(5)
gigatexal(3)

Polars Cloud and Distributed Polars now available

(pola.rs)

Show context

drej ◴[04 Sep 25 10:51 UTC] No.45125792[source]▶

Having done a bit of data engineering in my day, I'm growing more and more allergic to the DataFrame API (which I used 24/7 for years). From what I've seen over the past ~10 years, 90+% of use cases would be better served by SQL, both from the development perspective as well as debugging, onboarding, sharing, migrating etc.

Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.

replies(7): >>45125845 #>>45126294 #>>45127389 #>>45127993 #>>45128144 #>>45128518 #>>45134858 #

robertkoss ◴[04 Sep 25 11:01 UTC] No.45125845[source]▶

>>45125792 #

That is a false dichotomy. You can use SQL tools but still have to choose the instance type.

Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.

replies(2): >>45125950 #>>45126643 #

1. drej ◴[04 Sep 25 11:16 UTC] No.45125950[source]▶

>>45125845 #

As a user? No, I don't have to choose. What I'm saying is that analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies, JVM versions, cross-AZ pricing etc. In most cases, they should just get a connection string and/or a web UI to run their queries, everything abstracted from them.

Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.

replies(6): >>45126006 #>>45126053 #>>45126201 #>>45126376 #>>45126648 #>>45132599 #

2. ayhanfuat ◴[04 Sep 25 11:22 UTC] No.45126006[source]▶

>>45125950 (TP) #

I really doubt that Polars Cloud targets analysts doing ad-hoc analyses. It is much more likely towards people who build data pipelines for downstream tasks (ML etc).

replies(1): >>45126153 #

3. robertkoss ◴[04 Sep 25 11:27 UTC] No.45126053[source]▶

>>45125950 (TP) #

You were talking about data engineering. If you do not write tests as a data engineer what are you doing then? Just hoping that you don't fuck up editing a 1000 > line SQL script?

If you use Athena you still have to worry about shuffling and joining, it is just hidden.. It is Trino / Presto under the hood and if you click explain you can see the execution plan, which is essentially the same as looking into the SparkUI.

Who cares about JVM versions nowadays? No one is hosting Spark themselves.

Literally every tool now supports DataFrame AND SQL APIs and to me there is no reason to pick up SQL if you are familiar with a little bit of Python

replies(2): >>45126253 #>>45127783 #

4. ritchie46 ◴[04 Sep 25 11:40 UTC] No.45126153[source]▶

>>45126006 #

We also target ad-hoc analysis. If your data doesn't fit on your laptop, you can spin up a larger box or a cluster and run interactive queries.

5. ritchie46 ◴[04 Sep 25 11:46 UTC] No.45126201[source]▶

>>45125950 (TP) #

With Polars Cloud you don't have to choose those either. You can pick cpu/memory and we will offer autoscaling in a few months.

Cluster configuration is optional if you want this control. Anyhow, this doesn't have much to do with the query API, be it SQL or DataFrame.

6. drej ◴[04 Sep 25 11:54 UTC] No.45126253[source]▶

>>45126053 #

I was talking about data engineering, because that was my job and all analysts were downstream of me. And I could see them struggle with handling infrastructure and way too many toggles that our platform provided them (Databricks at the time).

Yes, I did write tests and no, I did not write 1000-line SQL (or any SQL for that matter). But I could see analysts struggle and I could see other people in other orgs just firing off simple SQL queries that did the same as non-portable Python mess that we had to keep alive. (Not to mention the far superior performance of database queries.)

But I knew how this all came to be - a manager wanted to pad their resume with some big data acronyms and as a result, we spent way too much time and money migrating to an architecture, that made everyone worse off.

7. mr_toad ◴[04 Sep 25 12:14 UTC] No.45126376[source]▶

>>45125950 (TP) #

Analysts don’t because it’s not part of the training & culture. If you’re writing tests you’re doing engineering.

That said the last Python code I wrote as a data engineer was to run tests on an SQL database, because the equivalent in SQL would have been tens of thousands of lines of wallpaper code.

8. gigatexal ◴[04 Sep 25 12:47 UTC] No.45126648[source]▶

>>45125950 (TP) #

Again the issue you’re having is the skill level of the audience you keep bringing up not the tool.

replies(1): >>45126753 #

9. drej ◴[04 Sep 25 12:58 UTC] No.45126753[source]▶

>>45126648 #

I find it much more beneficial to lower the barrier for entry (oftentimes without any sacrifices) instead of spending time and money on upskilling everyone, just because I like engineering.

replies(1): >>45127438 #

10. gigatexal ◴[04 Sep 25 14:07 UTC] No.45127438{3}[source]▶

>>45126753 #

Right but nobody is saying polars or data frames is to replace SQL or is even for the masses. It’s a tool for skilled folks. I personally think the api makes sense but SQL is easier to pick up. Use whatever tools work best.

But coming into such a discussion dunking on a tool cuz it’s not for the masses makes no sense.

replies(1): >>45127693 #

11. drej ◴[04 Sep 25 14:30 UTC] No.45127693{4}[source]▶

>>45127438 #

Read my posts again, I'm not complaining it's not for the masses, I know it isn't. I'm complaining that it's being forced upon people when there are simpler alternatives that help people focus on business problems rather than setting up virtual environments.

So I'm very much advocating for people to "[u]se whatever tools work best".

(That is - now I'm doing this. In the past I taught a course on pandas data analytics and spoke at a few PyData conferences and meetups, partly about dataframes and how useful they are. So I'm very much guilty of what all of the above.)

replies(1): >>45128084 #

12. datadrivenangel ◴[04 Sep 25 14:36 UTC] No.45127783[source]▶

>>45126053 #

Way too many data engineers are running in clown mode just eyeballing the results of 1000 line SQL scripts....

https://ludic.mataroa.blog/blog/get-me-out-of-data-hell/

13. gigatexal ◴[04 Sep 25 15:02 UTC] No.45128084{5}[source]▶

>>45127693 #

Who is doing the forcing? I’ve not found a place in my decade as a data engineer that such places forced dataframes on would be and capable SQL analysts.

14. riku_iki ◴[04 Sep 25 21:47 UTC] No.45132599[source]▶

>>45125950 (TP) #

> analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies,

I think this part(query optimizations) in general not solved/solvable, and it is sometimes/often(depending on domain) necessary to digg into details to make data transformation working.

↑