←back to thread

183 points jonbaer | 1 comments | | HN request time: 0s | source
Show context
drej ◴[] No.45125792[source]
Having done a bit of data engineering in my day, I'm growing more and more allergic to the DataFrame API (which I used 24/7 for years). From what I've seen over the past ~10 years, 90+% of use cases would be better served by SQL, both from the development perspective as well as debugging, onboarding, sharing, migrating etc.

Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.

replies(7): >>45125845 #>>45126294 #>>45127389 #>>45127993 #>>45128144 #>>45128518 #>>45134858 #
robertkoss ◴[] No.45125845[source]
That is a false dichotomy. You can use SQL tools but still have to choose the instance type.

Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.

replies(2): >>45125950 #>>45126643 #
drej ◴[] No.45125950[source]
As a user? No, I don't have to choose. What I'm saying is that analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies, JVM versions, cross-AZ pricing etc. In most cases, they should just get a connection string and/or a web UI to run their queries, everything abstracted from them.

Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.

replies(6): >>45126006 #>>45126053 #>>45126201 #>>45126376 #>>45126648 #>>45132599 #
gigatexal ◴[] No.45126648[source]
Again the issue you’re having is the skill level of the audience you keep bringing up not the tool.
replies(1): >>45126753 #
drej ◴[] No.45126753[source]
I find it much more beneficial to lower the barrier for entry (oftentimes without any sacrifices) instead of spending time and money on upskilling everyone, just because I like engineering.
replies(1): >>45127438 #
gigatexal ◴[] No.45127438[source]
Right but nobody is saying polars or data frames is to replace SQL or is even for the masses. It’s a tool for skilled folks. I personally think the api makes sense but SQL is easier to pick up. Use whatever tools work best.

But coming into such a discussion dunking on a tool cuz it’s not for the masses makes no sense.

replies(1): >>45127693 #
drej ◴[] No.45127693{3}[source]
Read my posts again, I'm not complaining it's not for the masses, I know it isn't. I'm complaining that it's being forced upon people when there are simpler alternatives that help people focus on business problems rather than setting up virtual environments.

So I'm very much advocating for people to "[u]se whatever tools work best".

(That is - now I'm doing this. In the past I taught a course on pandas data analytics and spoke at a few PyData conferences and meetups, partly about dataframes and how useful they are. So I'm very much guilty of what all of the above.)

replies(1): >>45128084 #
1. gigatexal ◴[] No.45128084{4}[source]
Who is doing the forcing? I’ve not found a place in my decade as a data engineer that such places forced dataframes on would be and capable SQL analysts.