←back to thread

Big Book of R

(www.bigbookofr.com)
288 points sebg | 2 comments | | HN request time: 0.001s | source
Show context
cye131 ◴[] No.43649039[source]
R especially dplyr/tidyverse is so underrated. Working in ML engineering, I see a lot of my coworkers suffering through pandas (or occasionally polars or even base Python without dataframes) to do basic analytics or debugging, it takes eons and gets complex so quickly that only the most rudimentary checks get done. Anyone working in data-adjacent engineering work would benefit from R/dplyr in their toolkit.
replies(6): >>43649143 #>>43649208 #>>43649881 #>>43650319 #>>43650677 #>>43683325 #
wwweston ◴[] No.43649208[source]
what’s the story integrating R code into larger software systems (say, a saas product)?

I’m sure part of Python’s success is sheer mindshare momentum from being a common computing denominator, but I’d guess the integration story is part of the margins. Your back end may well already be in python or have interop, reducing stack investment and systems tax.

replies(4): >>43649470 #>>43650307 #>>43650696 #>>43650785 #
dajtxx ◴[] No.43649470[source]
I am working on a system at present where the data scientist has done the calculations in an R script. We agreed upon an input data.frame and an output csv as our 'interface'.

I added the SQL query to the top of the R script to generate the input data.frame and my Python code reads the output CSV to do subsequent processing and storage into Django models.

I use a subprocess running Rscript to run the script.

It's not elegant but it is simple. This part of the system only has to run daily so efficiency isn't a big deal.

replies(1): >>43650051 #
shoemakersteve ◴[] No.43650051{3}[source]
Any reason you're using CSV instead of parquet?
replies(2): >>43650618 #>>43650957 #
1. epistasis ◴[] No.43650618{4}[source]
CSV seems to be a natural and easy fit. What advantage could parquet bring that would outweigh the disadvantage of adding two new dependencies? (One in Python and one in R)
replies(1): >>43650664 #
2. pjacotg ◴[] No.43650664[source]
Not the op, but I started using parquet instead of CSV because the types of the columns are preserved. At one point I was caching data to CSV but when you load the CSV again the types of certain columns like datetimes had to be set again.

I guess you'll need to decide whether this is a big enough issue to warrant the new dependencies.