I know many people think otherwise, but I hate R for many reasons. Here are some of them:
- You can use '=' and '<-' to assign values to variables and both do the same, except in a few edge-cases where you now spend one week finding the error
- It confuses and mixes functional programming and oop not only per entity but also between the usage of them. Want to get a value of entity X? use x.getValue(). Want to get a value of entity Y? Use Y.getValue(y).
- The ide crashes once an hour and does not detect file-changes which forces you to restart it manually.
- People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.
Disclaimer: My big-data-profs enforced me to use R even for tasks where R should not be used.
I've been a heavy R user for about 7 years, and I only slightly disagree with one of your points.
(In my opinion) R is best for traditional statistics, as opposed to AI, machine learning, predictive analytics, data science, data analysis or any other variant thereof.
If you're more concerned with Chi-squared tests than unit tests, or if you need to teach a mathematician or a biologist how to fit regression models and analyse residuals, goodness-of-fit statistics, p-values etc, then R is the best language for the job.
If you need to build a program (as opposed to just do a thing), or if you're more interested in accuracy than inference (as per most machine learning tasks), then Python with sklearn and pandas blows R out of the water.
> Python with sklearn and pandas blows R out of the water.
For some things yes, but for others the reverse is true. I'm also a heavy R and python user and find the two ecosystems extremely complementary. For building pipelines and web apps, python has an edge. For statistics, graphics, and data management, R is IMO superior. You can do everything in either language, but have to jump through hoops in some cases. Sometimes the best solution is use both!
For example, I run an internal web app for A/B testing using django and rpy2. Doing it all in python would have been sub-optimal because dataset management is so much simpler in R. Plots that were easy to do in ggplot2 were impossible to get right in matplotlib. The big drawback to this method is R's single-threaded architecture. Embedding R in a web server process is not easy (ask me!), and won't scale as well as a multi-threaded environment can.
All my data exploration and prototyping happens in R. Even basic report scripting can be done better in R than python because of the ease of data management. Consider a typical case of 1) run database query, 2) munge data around to produce a table, and 3) email or save to html. If you can't get exactly what you want from the database in one query and you have to do a lot of munging in step 2, then R is going to be more flexible than python. If I need to merge, aggregate, or recode variables, I would much rather use R. Doing all this with a list of lists "dataset" in python is convoluted at best, and recreating a lot of the functionality that base R gives you.
pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation. And while you would have been right about Python being better for machine learning a couple of years ago, these days basically every popular machine learning library in Python (Tensorflow, keras, etc.) now has an API in R.
I also don't know why you are separating "traditional statistics", "predictive analytics", and "data analysis". They often are the exact same thing. In fact, it makes me wonder how much experience you have with statistics if you are under the impression that it is somehow different from data analysis "or any other variant thereof".
You are right on exactly one count: Python is superior for putting data analytics into production. And that isn't an insignificant advantage. A lot of data science today involves packaging an analysis into some larger program or product, and Python is absolutely better suited to that task.
But in virtually every other case (including lots of machine learning problems), R is either as good if not greatly superior to Python.
I did start my post with the words "in my opinion". I am not right or wrong about anything, and neither are you. We're mostly talking about syntax preferences here.
I'm separating out traditional statistics as an alias for statistical inference - make distributional assumptions, test them, estimate the effect of X on y and put a 95% confidence interval around it. That sort of stuff.
It's the stuff that absolutely does not matter if you're assessing the overall effectiveness of a classifier, and certainly isn't needed in a lot of data analysis tasks where all you need are variations of counts and percentages.
For the record, my academic background is maths and statistics. I've picked up any software development experience on the job.
> pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation.
I would really like to hear a bit more about this, because this would greatly increase my motivation to learn more R. Specifically I've fiddled around with dplyr and it definitely feels more DSL-y but I didn't see a crazy benefit there. What are some of your favourite things about dplyr / data.table?
Took me a while to get back to you, but essentially dplyr is fantastic for readability and reproducibility. Reading through someone else's analysis, or even my own long after the fact, is orders of magnitude easier than base R, data.table, or pandas typically are.
data.table's advantage lies in its speed. It is by far the fastest of the three options. In just about every benchmark it either is significantly faster than pandas or at the very least is approximately equal.
Pandas is lauded by people who strictly use Python, and it really is fantastic considering how ridiculous data manipulation would be in Python without it. But its also the only option a Python user really has, so they've become married to the idea that it is best.
Basically, if you are using Python, use pandas. If you have an option, go for data.table for speed, dplyr for clarity, or a mix of the two if desired.
What I really like about dplyr is how simple it is. It essentially provides an SQL like selection of verbs (select, mutate, summarise, arrange) and handles lots of things for you.
As an example, these two statements are equivalent:
You can then use the pipe operator %>% to funnel the results of one operator into the next.
The real advantages is that you can easily build up a selection of functions which can be read from left to right (rather than right to left in summary(coef(mylm))) and the reduction in temporary variables.
Pandas, on the other hand looks like base R (which is fine, but not as nice as dplyr).
However, the niceness of pipes does all fall apart when you have an error in the middle and you need to start deleting things in order to debug.
> People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.
Really? Maybe you worked with R before data.table, dplyr, and the tidyverse packages? I'm not that familiar with pandas in Python, but there is an incredible amount of productivity to be gained from knowing your way around a set of just around 5 packages in R that I never had when working with C++, Java, Perl, or Ruby.
Also, it could just be me, but I overcome the = and <- confusion by simply never using =.
I feel your pain! It took me a long time to get used to R. The only reason I tolerate it is I used SAS before that, so my point of comparison is an even more obtuse programming framework! Some general advice should you want to work with R some more:
- For assignment, always use '<-'. Read it as "set to". For example, "x <- runif(10)" means "set x to a vector of 10 uniform random numbers". When passing arguments in function calls, use '='.
- If the IDE gives you problems, try using the command line. R Studio or the R GUI app are not necessary. Simply type 'R' in a shell and you have an interactive read-line environment. Use the shell for exploratory work, then write code in your favorite editor and copy/paste after developing a series of commands you want to run.
- Use base R as much as possible, don't install a new package just for one function that you could do with base R functions, even if it's not elegant. Package bloat is one reason for inconsistencies in APIs. Some package developers will make you do x.getValue() and others getValue(x). But remember these are 3rd party packages. You can do a lot using just base R and a few select packages that are well respected (gglot2, dplyr, Hmisc, reshape).
That's funny, I absolutely never use '<-'. Mostly because it's 2 characters, and because it's inconsistent with most languages. Haven't ever run into any issues because of '='.
I like your base R point, although like you say, some packages are simply essential.
I complain about this every time a post on R programming comes up here, but my favorite thing to hate (our of many) about R is that there's no way to find out what the directory of the current script is. Imagine someone would want to use relative paths to their data files so that they could version control their scripts and run them unmodified on different machines! We wouldn't want to enable such abominations now would we!
I think you need to reference the data files from the working directory, not the directory where the script currently is. The two aren't necessarily the same.
The current working directory can be found with getwd() and set with setwd().
If you set the working directory at the beginning of the script, paths to data files should be relative to that location.
Yes but for example when running from within RStudio, or calling from other scripts, the two aren't the same. Calling from other scripts you can do chdir() first of course, but my point is that you can't sensibly rely in your script on cd and script path to be the same.
I've actually noticed this and was totally blown out of the water by it. I understand you can use getwd() and setwd() but I thought you could simply do relative paths (similar to other languages) but it doesn't always work and I haven't figured it out.
For example, if you are loading a data.frame from a csv, my.df <- as.data.frame(read.csv("file.csv")) seems to work if the R script is in the same directory as the .csv. This is what I tend to do in .Rmd code chunks (which is my primary R workflow). It also tends to work across platforms which is handy as who knows what box I'm going to be hacking away on. However, R's preference for absolute paths in general I find very strange as I'm always on different machines with, of course, different directory structures. Isn't everyone?
Regardless, R is funky but I think I like it in a sort of awkward 'first date not sure yet' kind of vibe. I'm a noob and novice programmer otherwise though so who knows.
No, that gets you the working directory, which isn't always the same (like, when running from RStudio, getwd() returns the RStudio installation path IIRC).
if you run scripts non interactively, you could try commandArgs? That should contain the file path. For Rstudio maybe the rstudioapi package has a function like that...
Well yes, there are several workarounds; to the point that there are packages that wrap up all methods and try to decide which one is the correct one in the given invocation. This is the problem with R - there are many things for which you need only a single line to do something very complicated, but there are also many things that are just a tiny bit different from the standard cases, and are absurdly complex. Everything is just slapped together, without thought for the overall picture or overarching design.
Google stack overflow for 'R get current script path' some time, and weep not only at how often this is asked and upvoted (i.e., how many people suffer from this), but also at the suggestions offered - how divergent they are, and how complicated. But this is just one example. R is death by a thousand cuts.
not gonna argue that : D Been working with R for 3+ years now and totally second the "there are also many things that are just a tiny bit different from the standard cases, and are absurdly complex".
I can gradually move on to python at work now, which so far has been much more pleasant.
It always surprises me what you can end up doing in R though, but really shouldn't if you want to go to production : )
Yes, and now I want to also make it work when not invoked from RStudio; and for various R version. So now I find myself wrapping all these options into a function, which I have to copy for every 10 line script. So then I make a package for it; or use the functions in someone else's package and add a dependency which I'm not sure will still work a year from now.
Or I could just use a sane language and go home in time for dinner.
(I mean I know about all the solutions and non-solutions; I've looked into this at least a dozen times over the last 5+ years. My point is that this shouldn't have been an issue in the first place.)
You are absolutely right, but then either you first post was missworded or I missunderstood the issue (most likely the latter), as there is a way to know the directory of the script.
Ah yes now I see - I said 'there's no way to find the current script' which isn't true. So that's probably what the others in this thread are also objecting against :) I guess what I meant was 'there's no same way' or 'look at how hard it is to do this tiny thingy which anyone with a programming background would find so basic, they wouldn't even consider it might not exist'. So yeah, I did screw up on making my point there.
Nope! Base R works great. Old-school vi to edit scripts, and R base installation to run them (or REPL around). Of course, the IDEs do offer a lot of support, and RStudio is great for making your R functions into packages that are easy to share.
That was in reference to the rstudioapi package for finding the path of the current file, which I've just checked out needs a running Rstudio session to work.
> - The ide crashes once an hour and does not detect file-changes which forces you to restart it manually.
There's more than one IDE for R [0], and strictly that's not a problem with R itself, but the people who built the IDE.
> - People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.
I think there's very little marketing behind R. It's predominantly a statistics package, but clearly a lot of statisticians are using it for data analysis. So I think it's people that using it for data analysis that talk about it, not bloggers paid to write about it.
I’ve known many quants use both R and python/numpy/pandas for complimentary tasks. The R standard library was generally spoken about in positive terms, but for data massaging and manipulation beyond pure maths/stats analysis a python environment probably offers much more flexibility.
Note that I don’t claim expertise in the above, but a bunch of very talented people I’ve worked directly with, and who were very directly incentivized to be productive, used R.
Perhaps your profs were trying to help you learn R, including its limitations, when they were setting you tasks?
This is a really big deal. In the first edition of Python for data analysis, they suggest using mean imputation. In case you don't know, this will totally break your variance calculations and thus any statistical tests.
In the second edition, they suggest doing some interpolation. Meanwhile, in R land there are multiple ways (as always) to do useful multiple imputation which gets you a much more accurate analysis which makes better use of all of the data (mice, Amelia and mi are all good, and somewhat complimentary).
That being said, I just thought of using PyTorch and a GAN to do multiple imputation, so maybe it's not impossible to do in Python. There is way, way less support for it though (but of course you could probably build in Numpy).
I guess the big difference is that R comes with numpy equivalent (matrix), a pandas equivalent (data.frame and base), and a well-tested, numerically-stable and reference implementation of pretty much all widely used statistical models.
Like, I really don't understand why you wouldn't want to look at residuals, even if all you care about is prediction. Your predictions will be much more stable and accurate, and it can often inform you as to how to model things more appropriately.
Finally, R's formula interface is a thing of beauty. Honestly, why the hell do I need to generate a model matrix for regression/classification when I can get R to do it for me.
I will also say that R is a frustrating, domain-specific, really irritating, wonderful language. But then I'm a crazy person, I wrote a stockfighter client in R.
I agree that there are also some good parts with R.
But the argument "It's good because many people use it" is the one I heard most often when it comes to discussion about programming languages especially old ones like R and java.
I do not really disagree with you, except for the '<-' bit, just map it to a keyboard shortcut, and move on :).
But I would give R a try with the tidyverse, it made me go from hating R to just not caring about it.
While libraries are extremely inconsistent, if you want to use cutting edge statistical methdos as a researcher, you pretty much have no other option. Finally, data wrangling is quite well developed in the R evironment.
So long story short, after many years of hating R, now I just find it a handy tool to do my work despite it being old, inconsistent and sometimes annoying.
I don't get how a keyboard shortcut deals with the '<-' issue, which is the occasional and subtle difference in semantics from '='. Even if you can pick just one for your own work, it doesn't help with other people's code.
Also Tidyverse and data.table are the main reason for the sudden explosion of R's popularity. For me I love the piping since I am an old time bash user and | becomes %>% in R is the best thing for the way i think.
Typing '<-' is a fairly trivial matter, I think, compared to the semantic issues raised at the start of this thread, and covered in the question and answers below - for example (from the chosen answer): "R's syntax contains many ambiguous cases that have to be resolved one way or another. The parser chooses to resolve the bits of the expression in different orders depending on whether = or <- was used."
Just use the '<-' all the time, except in function calls.
If you can use '<-' with an easy key press you will not be tempted to use '='. And the problem is greatly reduced.
The namespacing drives me nuts but it doesn't get mentioned very often in these kind of threads. How are people just ok with loading everything in the same namespace? You can use ::, but then that can have a ton of overhead.
Yeah, the namespacing thing is crazy. If its any consolation, it was much, much worse before R 3.0. Originally, packages used to clobber each other's namespaces, which lead to much hilariousness and non-deterministic bugs.
Now those hilarious bugs only happen at the REPL, which is a little better. If these kinds of bugs cause problems for you, i strongly recommend creating packages for your analysis/projects. It doesn't add that much complexity (with devtools, at least) and it does avoid a lot of these problems. Also, R packages require documentation, which is better than many other languages.
I've attempted to address the documentation issue at https://rdrr.io
The test coverage issue is what originally prompted me to get involved -- I was evaluating different EM solvers and found a lot of crazy obvious bugs (parameters backwards or totally ignored). There's a lot of room for improvement on the quality front. MRAN and Tidyverse are thankfully making some headway.
Not completely sure this is what you mean, but you can certainly run install.packages("package") right in your code, doesn't have to be done interactively. Usually want to first check if it is already installed, like with require("package").
nice list of R's WTFs. I've recently hit an issue that you access properties of "S4 classes" (what are those? "The S4 object system. R has three object oriented (OO) systems: [[S3]], [[S4]] and [[R5]].") using @ instead of $.
I think that the users/library creators are also guilty of why working with R is such a pain. Giving them the option to overload operator was a major mistake. C++ programmers are more often engineers who have more concern for the code reader and even they probably overuse it.
"You can use '=' and '<-' to assign values to variables and both do the same, except in a few edge-cases where you now spend one week finding the error"
Can you tell us a bit about those edge-cases which can lead to hard finding bugs?
- There is also '->' which can be even more confusing (or helpful if you use pipes)
- Aren't those methods defined by each package/object? Or you mean '@'?
- RStudio (the most used IDE) is one of the reasons that I use R so heavily. Never encountered your problems. I can even git checkout to another branch without problems and the new versions are loaded without problem.
- For a quick descriptive analysis or some tests I don't know something easier, compared to SQL or Python. But that's probably only personal preferences and/or knowledge of the language.
And I can see why some people like R. They are end users for whom the language was explicitly designed, so they like the ergonomics (to use the term Rustaceans are popularizing.)
The thing is, like Perl and Latex and other products you could think of, R was initially written by people with a good idea of the end uses and how to enable those end uses, but not a good idea on how to reconcile those ergonomics with the need for a clean parseable syntax.
So if you make too extensive a reliance on R, you wind up having to hire someone like me.
I really enjoy R and the more I learn programming the more I enjoy it. Best thing I ever did was learn the language Racket and How to Design Programs and Hadley Wickham's tidyverse.
I moved from Python to R about six years ago. Before that I did most of my work in the command-line. R's rise in popularity has been caused by the libraries in the tidyverse and data.table. The millions of dollars invested into R by many companies and an amazing eco-system
> You can use '=' and '<-' to assign values to variables and both do the same, except in a few edge-cases where you now spend one week finding the error
Its just R's symantics. It is almost universaly spoken to just use <- for style consistency and edge cases. Use the RStudio shortcut `Alt` and `-`. The reason you spend a week is the reason why it is recommended for all users to just use <-.
> It confuses and mixes functional programming and oop not only per entity but also between the usage of them. Want to get a value of entity X? use x.getValue(). Want to get a value of entity Y? Use Y.getValue(y).
R comes from S the creators of R were also inspired by Scheme. Personally I learned the language Racket to be a better R programmer and I pretty much live in the Functional side of R. I actually like the fact that they added more functional core to R from S+. http://r.cs.purdue.edu/pub/ecoop12.pdf
> The ide crashes once an hour and does not detect file-changes which forces you to restart it manually.
Then use a different system. R does not equal RStudio. I have never experienced this and I have worked in Linux, Mac and Windows 7 - 10. To me RStudio is the best example of an electron app and the only IDE that I actually use the built in git feature. R Projects and RWorkbooks are the best features of RStudio.
> People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.
So R is just as easy to use for data analysis and has best in class statistics? Also I think tidyverse is much easier than any other data analysis system I have ever seen, but I guess that is just my opinion.
> It's a marketing-lie spread by the creators.
What does Ross Ihaka and Robert Gentleman have to gain for marketing and what lie has they ever said. This falls into conspiracy theory.
There is a large community of R users and we like R. There are a ton of
Lightning (http://lightning-viz.org/) is a pretty cool interactive visualization server, with clients that work across multiple languages/environments. I've used it with R and Python and it's pretty slick.
I'm always a bit disappointed that yhat indiscriminately used 'ggplot' for the python package name. Using a variation on the name would have been more considerate.
I tried a few years ago and found that it didn't implement much and didn't work correctly. Since then I usually use something like rpy2 to run standard ggplot2 from Python.
Depends on the problem to solve.
For calculation-heavy tasks i prefer python.
But these days many use R to create shiny-apps which should be done in javascript instead.
That's one way to look at it.
The other perspective is that with shiny 'many' are able to create interactive apps to display and explore data which would have required 5x as much time to make with js +/- d3
- You can use '=' and '<-' to assign values to variables and both do the same, except in a few edge-cases where you now spend one week finding the error
- It confuses and mixes functional programming and oop not only per entity but also between the usage of them. Want to get a value of entity X? use x.getValue(). Want to get a value of entity Y? Use Y.getValue(y).
- The ide crashes once an hour and does not detect file-changes which forces you to restart it manually.
- People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.
Disclaimer: My big-data-profs enforced me to use R even for tasks where R should not be used.