Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.
Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.
If you use Athena you still have to worry about shuffling and joining, it is just hidden.. It is Trino / Presto under the hood and if you click explain you can see the execution plan, which is essentially the same as looking into the SparkUI.
Who cares about JVM versions nowadays? No one is hosting Spark themselves.
Literally every tool now supports DataFrame AND SQL APIs and to me there is no reason to pick up SQL if you are familiar with a little bit of Python
Cluster configuration is optional if you want this control. Anyhow, this doesn't have much to do with the query API, be it SQL or DataFrame.
Yes, I did write tests and no, I did not write 1000-line SQL (or any SQL for that matter). But I could see analysts struggle and I could see other people in other orgs just firing off simple SQL queries that did the same as non-portable Python mess that we had to keep alive. (Not to mention the far superior performance of database queries.)
But I knew how this all came to be - a manager wanted to pad their resume with some big data acronyms and as a result, we spent way too much time and money migrating to an architecture, that made everyone worse off.
That said the last Python code I wrote as a data engineer was to run tests on an SQL database, because the equivalent in SQL would have been tens of thousands of lines of wallpaper code.
But coming into such a discussion dunking on a tool cuz it’s not for the masses makes no sense.
So I'm very much advocating for people to "[u]se whatever tools work best".
(That is - now I'm doing this. In the past I taught a course on pandas data analytics and spoke at a few PyData conferences and meetups, partly about dataframes and how useful they are. So I'm very much guilty of what all of the above.)
I think this part(query optimizations) in general not solved/solvable, and it is sometimes/often(depending on domain) necessary to digg into details to make data transformation working.