Most active commenters
  • kerkeslager(3)

←back to thread

Big Book of R

(www.bigbookofr.com)
288 points sebg | 16 comments | | HN request time: 2.092s | source | bottom
Show context
cye131 ◴[] No.43649039[source]
R especially dplyr/tidyverse is so underrated. Working in ML engineering, I see a lot of my coworkers suffering through pandas (or occasionally polars or even base Python without dataframes) to do basic analytics or debugging, it takes eons and gets complex so quickly that only the most rudimentary checks get done. Anyone working in data-adjacent engineering work would benefit from R/dplyr in their toolkit.
replies(6): >>43649143 #>>43649208 #>>43649881 #>>43650319 #>>43650677 #>>43683325 #
1. wwweston ◴[] No.43649208[source]
what’s the story integrating R code into larger software systems (say, a saas product)?

I’m sure part of Python’s success is sheer mindshare momentum from being a common computing denominator, but I’d guess the integration story is part of the margins. Your back end may well already be in python or have interop, reducing stack investment and systems tax.

replies(4): >>43649470 #>>43650307 #>>43650696 #>>43650785 #
2. dajtxx ◴[] No.43649470[source]
I am working on a system at present where the data scientist has done the calculations in an R script. We agreed upon an input data.frame and an output csv as our 'interface'.

I added the SQL query to the top of the R script to generate the input data.frame and my Python code reads the output CSV to do subsequent processing and storage into Django models.

I use a subprocess running Rscript to run the script.

It's not elegant but it is simple. This part of the system only has to run daily so efficiency isn't a big deal.

replies(1): >>43650051 #
3. shoemakersteve ◴[] No.43650051[source]
Any reason you're using CSV instead of parquet?
replies(2): >>43650618 #>>43650957 #
4. kerkeslager ◴[] No.43650307[source]
This is, I think, the main reason R has lost a lot of market share to Pandas. As far as I know, there's no way to write even a rudimentary web interface (for example) in R, and if there is, I think the language doesn't suit the task very well. Pandas might be less ergonomic for statistical tasks, but when you want to do anything with the statistical results, you've got the entire Python ecosystem at your fingertips. I'd love to see some way of embedding R in Python (or some other language).
replies(2): >>43650403 #>>43650626 #
5. notagoodidea ◴[] No.43650403[source]
There is a lot of way and the most common is shiny (https://shiny.posit.co/) but with a biais towards data app. Not having a Django-like or others web stack python may have talks more about the users of R than the language per se. Its background was to replace S which was a proprietary statistics language not to enter competition with Perl used in CGI and early web. R is very powerful and is Lisp in disguise coupled with the same infrastructure that let you use C under the hood like python for most libraries/packages.
replies(1): >>43650669 #
6. epistasis ◴[] No.43650618{3}[source]
CSV seems to be a natural and easy fit. What advantage could parquet bring that would outweigh the disadvantage of adding two new dependencies? (One in Python and one in R)
replies(1): >>43650664 #
7. djhn ◴[] No.43650626[source]
Plumber is a mature package for building an api in R.

https://www.rplumber.io/

For capital P Production use I would still rewrite it in rust (polars) or go (stats). But that’s only if it’s essential to either achieve high throughput with concurrency or measure performance in nanoseconds vs microseconds.

replies(2): >>43650680 #>>43653776 #
8. pjacotg ◴[] No.43650664{4}[source]
Not the op, but I started using parquet instead of CSV because the types of the columns are preserved. At one point I was caching data to CSV but when you load the CSV again the types of certain columns like datetimes had to be set again.

I guess you'll need to decide whether this is a big enough issue to warrant the new dependencies.

9. kerkeslager ◴[] No.43650669{3}[source]
> There is a lot of way and the most common is shiny (https://shiny.posit.co/) but with a biais towards data app.

I tried Shiny a few years back and frankly it was not good enough to be considered. Maybe it's matured since then--I'll give it another look.

> Not having a Django-like or others web stack python may have talks more about the users of R than the language per se. Its background was to replace S which was a proprietary statistics language not to enter competition with Perl used in CGI and early web.

I'm aware, but that doesn't address the problem I pointed out in any way.

> R is very powerful and is Lisp in disguise coupled with the same infrastructure that let you use C under the hood like python for most libraries/packages.

Things I don't want to ever do: use C to write a program that displays my R data to the web.

10. kerkeslager ◴[] No.43650680{3}[source]
Plumber is the first solution to this problem I've seen that I'd actually use--it seems like I'd be calling the API from Python or perhaps JS on the frontend, but that's a pretty reasonable integration layer and I don't think that would be a problem.

Thanks for posting!

11. wodenokoto ◴[] No.43650696[source]
It's getting a lot better, but R in production was something companies 10 years ago would say "so we figured out a way".

The problem is pinning dependencies. So while an R analysis written using base R 20 or 30 years ago works fine, something using dplyr is probably really difficult to get up and running.

At my old work we took a copy of CRAN when we started a new project and added dependencies from then.

So instead of asking for dplyr version x.y, as you'd do ... anywhere, we added dplyr as it and its dependencies where stored on CRAN on this specific date.

We also did a lot of systems programming in R, which I thought of as weird, but for the exact same reason as you are saying for Python.

But R is really easy to install, so I don't see why you can't setup a step in your pipeline that does R - or even both R and Python. They can read dataframes from eachothers memory.

replies(1): >>43653788 #
12. vhhn ◴[] No.43650785[source]
There are so many options to emded R in any kind of system. Thanks to the C API, there are connectors for any if the traditional language. There is also RServe and plumber for inter-process interaction. Managing dependencies is also super easy.

My employer is using R to crunch numbers enbeded in a large system based on microservices.

The only thing to keep in mind is that most people writing R are not programmers by trade so it is good to have one person on the project who can refactor their code from time to time.

13. pletnes ◴[] No.43650957{3}[source]
Many of the reasons csv is bad is because you don’t control both reader and writer. Here, if you’re 2 persons that collaborate OK, they should be fine.
14. _Wintermute ◴[] No.43653776{3}[source]
We tried plumber at work and ran into enough issues (memory leaks, difficulty wrangling JSON in R, poor performance) that I don't think I could recommend it.
replies(1): >>43655889 #
15. mrbananagrabber ◴[] No.43653788[source]
renv and rocker have really addressed these issues for using R in production

https://rstudio.github.io/renv/index.html

https://rocker-project.org/images/

16. hadley ◴[] No.43655889{4}[source]
You might be interested in https://github.com/posit-dev/plumber2