Most active commenters
  • kerkeslager(3)

←back to thread

Big Book of R

(www.bigbookofr.com)
288 points sebg | 33 comments | | HN request time: 1.135s | source | bottom
1. cye131 ◴[] No.43649039[source]
R especially dplyr/tidyverse is so underrated. Working in ML engineering, I see a lot of my coworkers suffering through pandas (or occasionally polars or even base Python without dataframes) to do basic analytics or debugging, it takes eons and gets complex so quickly that only the most rudimentary checks get done. Anyone working in data-adjacent engineering work would benefit from R/dplyr in their toolkit.
replies(6): >>43649143 #>>43649208 #>>43649881 #>>43650319 #>>43650677 #>>43683325 #
2. kasperset ◴[] No.43649143[source]
I love R and dplyr. It is very readable and easy to explain to non-programmers. I use it almost everyday. Not exactly on the topic,I am having difficulties debugging it. May be I need to brush up on debugging R. Not sure if there is a easy way to add breakpoint when using vscode.
replies(2): >>43649189 #>>43651350 #
3. JackeJR ◴[] No.43649189[source]
browser() ?
replies(1): >>43651046 #
4. wwweston ◴[] No.43649208[source]
what’s the story integrating R code into larger software systems (say, a saas product)?

I’m sure part of Python’s success is sheer mindshare momentum from being a common computing denominator, but I’d guess the integration story is part of the margins. Your back end may well already be in python or have interop, reducing stack investment and systems tax.

replies(4): >>43649470 #>>43650307 #>>43650696 #>>43650785 #
5. dajtxx ◴[] No.43649470[source]
I am working on a system at present where the data scientist has done the calculations in an R script. We agreed upon an input data.frame and an output csv as our 'interface'.

I added the SQL query to the top of the R script to generate the input data.frame and my Python code reads the output CSV to do subsequent processing and storage into Django models.

I use a subprocess running Rscript to run the script.

It's not elegant but it is simple. This part of the system only has to run daily so efficiency isn't a big deal.

replies(1): >>43650051 #
6. joshdavham ◴[] No.43649881[source]
Totally agreed that R is underrated. I'm sad that I stopped using it after graduation.
7. shoemakersteve ◴[] No.43650051{3}[source]
Any reason you're using CSV instead of parquet?
replies(2): >>43650618 #>>43650957 #
8. kerkeslager ◴[] No.43650307[source]
This is, I think, the main reason R has lost a lot of market share to Pandas. As far as I know, there's no way to write even a rudimentary web interface (for example) in R, and if there is, I think the language doesn't suit the task very well. Pandas might be less ergonomic for statistical tasks, but when you want to do anything with the statistical results, you've got the entire Python ecosystem at your fingertips. I'd love to see some way of embedding R in Python (or some other language).
replies(2): >>43650403 #>>43650626 #
9. vishnugupta ◴[] No.43650319[source]
As someone who is learning probability and statistics for recreation, I wholeheartedly agree. I wish I had come across R and dplyr/tidyverse/ggplot2 back in college while learning probability and stats. They were quite boring and drudgery to study because I wasn't aware of R to play around with data.

Well, better late than never I guess.

replies(1): >>43652191 #
10. notagoodidea ◴[] No.43650403{3}[source]
There is a lot of way and the most common is shiny (https://shiny.posit.co/) but with a biais towards data app. Not having a Django-like or others web stack python may have talks more about the users of R than the language per se. Its background was to replace S which was a proprietary statistics language not to enter competition with Perl used in CGI and early web. R is very powerful and is Lisp in disguise coupled with the same infrastructure that let you use C under the hood like python for most libraries/packages.
replies(1): >>43650669 #
11. epistasis ◴[] No.43650618{4}[source]
CSV seems to be a natural and easy fit. What advantage could parquet bring that would outweigh the disadvantage of adding two new dependencies? (One in Python and one in R)
replies(1): >>43650664 #
12. djhn ◴[] No.43650626{3}[source]
Plumber is a mature package for building an api in R.

https://www.rplumber.io/

For capital P Production use I would still rewrite it in rust (polars) or go (stats). But that’s only if it’s essential to either achieve high throughput with concurrency or measure performance in nanoseconds vs microseconds.

replies(2): >>43650680 #>>43653776 #
13. pjacotg ◴[] No.43650664{5}[source]
Not the op, but I started using parquet instead of CSV because the types of the columns are preserved. At one point I was caching data to CSV but when you load the CSV again the types of certain columns like datetimes had to be set again.

I guess you'll need to decide whether this is a big enough issue to warrant the new dependencies.

14. kerkeslager ◴[] No.43650669{4}[source]
> There is a lot of way and the most common is shiny (https://shiny.posit.co/) but with a biais towards data app.

I tried Shiny a few years back and frankly it was not good enough to be considered. Maybe it's matured since then--I'll give it another look.

> Not having a Django-like or others web stack python may have talks more about the users of R than the language per se. Its background was to replace S which was a proprietary statistics language not to enter competition with Perl used in CGI and early web.

I'm aware, but that doesn't address the problem I pointed out in any way.

> R is very powerful and is Lisp in disguise coupled with the same infrastructure that let you use C under the hood like python for most libraries/packages.

Things I don't want to ever do: use C to write a program that displays my R data to the web.

15. aquafox ◴[] No.43650677[source]
Why not mix R and Python in interactive analysis workflows: 1) Download positron: https://github.com/posit-dev/positron 2) Set up a quarto (.qmd) notebook 3) Set up R and Python code chunks in tour quarto document 4a) Use reticulate to spawn a Python session inside R and exchange objects beween both languages (https://github.com/posit-dev/positron/pull/4603) 4b) Write a few helper functions that pass objects between R and Python by reading/writing a temporary file.
replies(5): >>43650688 #>>43653111 #>>43656358 #>>43657369 #>>43690598 #
16. kerkeslager ◴[] No.43650680{4}[source]
Plumber is the first solution to this problem I've seen that I'd actually use--it seems like I'd be calling the API from Python or perhaps JS on the frontend, but that's a pretty reasonable integration layer and I don't think that would be a problem.

Thanks for posting!

17. dkga ◴[] No.43650688[source]
This is exactly what I do for the vast majority of my academic papers. It combines the power and flexibility of R for statistics, which I agree with the upstream poster is incredibly underrated (especially with tidyverse) with python.
18. wodenokoto ◴[] No.43650696[source]
It's getting a lot better, but R in production was something companies 10 years ago would say "so we figured out a way".

The problem is pinning dependencies. So while an R analysis written using base R 20 or 30 years ago works fine, something using dplyr is probably really difficult to get up and running.

At my old work we took a copy of CRAN when we started a new project and added dependencies from then.

So instead of asking for dplyr version x.y, as you'd do ... anywhere, we added dplyr as it and its dependencies where stored on CRAN on this specific date.

We also did a lot of systems programming in R, which I thought of as weird, but for the exact same reason as you are saying for Python.

But R is really easy to install, so I don't see why you can't setup a step in your pipeline that does R - or even both R and Python. They can read dataframes from eachothers memory.

replies(1): >>43653788 #
19. vhhn ◴[] No.43650785[source]
There are so many options to emded R in any kind of system. Thanks to the C API, there are connectors for any if the traditional language. There is also RServe and plumber for inter-process interaction. Managing dependencies is also super easy.

My employer is using R to crunch numbers enbeded in a large system based on microservices.

The only thing to keep in mind is that most people writing R are not programmers by trade so it is good to have one person on the project who can refactor their code from time to time.

20. pletnes ◴[] No.43650957{4}[source]
Many of the reasons csv is bad is because you don’t control both reader and writer. Here, if you’re 2 persons that collaborate OK, they should be fine.
21. disgruntledphd2 ◴[] No.43651046{3}[source]
trace subsumes browser, it's much more flexible and can be applied to library code without editing it.
replies(2): >>43653795 #>>43657274 #
22. itsmevictor ◴[] No.43651350[source]
Have you checked this extension? https://marketplace.visualstudio.com/items?itemName=RDebugge...
23. gnuly ◴[] No.43652191[source]
R was the first thing we had in our syllabus for (shallow)Machine Learning.

the ease of doing `model <- lm(speed~dist, cars)` and then `predict(model, data.frame(dist = c(42)))` is unparalled.

24. _Wintermute ◴[] No.43653776{4}[source]
We tried plumber at work and ran into enough issues (memory leaks, difficulty wrangling JSON in R, poor performance) that I don't think I could recommend it.
replies(1): >>43655889 #
25. mrbananagrabber ◴[] No.43653788{3}[source]
renv and rocker have really addressed these issues for using R in production

https://rstudio.github.io/renv/index.html

https://rocker-project.org/images/

26. tylermw ◴[] No.43653795{4}[source]
trace is great for shimming in your own code to an existing function, but it’s not an interactive debugging tool.
replies(1): >>43655610 #
27. disgruntledphd2 ◴[] No.43655610{5}[source]
It sure is. If you set the second argument to browser you can step through any function.
28. hadley ◴[] No.43655889{5}[source]
You might be interested in https://github.com/posit-dev/plumber2
29. goosedragons ◴[] No.43656358[source]
Org mode in Emacs is even better at this IMO. Only downside is that no guarantee other people use Emacs too.
30. wdkrnls ◴[] No.43657274{4}[source]
Is there a way to trace an attribute to a function? I couldn't find one, but curious if it exists. I seemed blocked by the fact that trace seemed to expect a name as a character string. Some functions in base R have functions in their attributes which modify their behavior (e.g. selfStart). I ended up just copying the whole code locally and then naming it, but for a better interactive experience I really wish there was a way to pass a function object as I can with debug.
31. b-rodrigues ◴[] No.43657369[source]
I'm writing a package called rixpress that leverages Nix to build reproducible pipelines with targets in either R or Python

Here's the github to the package https://github.com/b-rodrigues/rixpress/tree/master

and here's an example pipeline https://github.com/b-rodrigues/rixpress_demos/tree/master/py...

32. fithisux ◴[] No.43683325[source]
Life saver. I do not use the raw dataframe API, inconsistent and error prone.
33. p00dles ◴[] No.43690598[source]
Is this what tools like Nextflow or Snakemake aim to do? I don't know, and I'm genuinely curious, because I'm starting to work in bioinformatics and doing different parts of an analysis pipeline in R and Python seems common, and, necessary really if you want to use certain packages.

I'm wondering if I should devote time to learning Nextflow/Snakemake, or whether the solution that you outlined is "sufficient" (I say "sufficient" in quotes because of course, depends on the use case).