I know many people think otherwise, but I hate R for many reasons. Here are some...

ploika · on Dec 7, 2017

I've been a heavy R user for about 7 years, and I only slightly disagree with one of your points.

(In my opinion) R is best for traditional statistics, as opposed to AI, machine learning, predictive analytics, data science, data analysis or any other variant thereof.

If you're more concerned with Chi-squared tests than unit tests, or if you need to teach a mathematician or a biologist how to fit regression models and analyse residuals, goodness-of-fit statistics, p-values etc, then R is the best language for the job.

If you need to build a program (as opposed to just do a thing), or if you're more interested in accuracy than inference (as per most machine learning tasks), then Python with sklearn and pandas blows R out of the water.

czep · on Dec 7, 2017

> Python with sklearn and pandas blows R out of the water.

For some things yes, but for others the reverse is true. I'm also a heavy R and python user and find the two ecosystems extremely complementary. For building pipelines and web apps, python has an edge. For statistics, graphics, and data management, R is IMO superior. You can do everything in either language, but have to jump through hoops in some cases. Sometimes the best solution is use both!

For example, I run an internal web app for A/B testing using django and rpy2. Doing it all in python would have been sub-optimal because dataset management is so much simpler in R. Plots that were easy to do in ggplot2 were impossible to get right in matplotlib. The big drawback to this method is R's single-threaded architecture. Embedding R in a web server process is not easy (ask me!), and won't scale as well as a multi-threaded environment can.

All my data exploration and prototyping happens in R. Even basic report scripting can be done better in R than python because of the ease of data management. Consider a typical case of 1) run database query, 2) munge data around to produce a table, and 3) email or save to html. If you can't get exactly what you want from the database in one query and you have to do a lot of munging in step 2, then R is going to be more flexible than python. If I need to merge, aggregate, or recode variables, I would much rather use R. Doing all this with a list of lists "dataset" in python is convoluted at best, and recreating a lot of the functionality that base R gives you.

v3gas · on Dec 7, 2017

Do you not use pandas?

thousandautumns · on Dec 7, 2017

pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation. And while you would have been right about Python being better for machine learning a couple of years ago, these days basically every popular machine learning library in Python (Tensorflow, keras, etc.) now has an API in R.

I also don't know why you are separating "traditional statistics", "predictive analytics", and "data analysis". They often are the exact same thing. In fact, it makes me wonder how much experience you have with statistics if you are under the impression that it is somehow different from data analysis "or any other variant thereof".

You are right on exactly one count: Python is superior for putting data analytics into production. And that isn't an insignificant advantage. A lot of data science today involves packaging an analysis into some larger program or product, and Python is absolutely better suited to that task.

But in virtually every other case (including lots of machine learning problems), R is either as good if not greatly superior to Python.

ploika · on Dec 7, 2017

I did start my post with the words "in my opinion". I am not right or wrong about anything, and neither are you. We're mostly talking about syntax preferences here.

I'm separating out traditional statistics as an alias for statistical inference - make distributional assumptions, test them, estimate the effect of X on y and put a 95% confidence interval around it. That sort of stuff.

It's the stuff that absolutely does not matter if you're assessing the overall effectiveness of a classifier, and certainly isn't needed in a lot of data analysis tasks where all you need are variations of counts and percentages.

For the record, my academic background is maths and statistics. I've picked up any software development experience on the job.

makmanalp · on Dec 7, 2017

> pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation.

I would really like to hear a bit more about this, because this would greatly increase my motivation to learn more R. Specifically I've fiddled around with dplyr and it definitely feels more DSL-y but I didn't see a crazy benefit there. What are some of your favourite things about dplyr / data.table?

thousandautumns · on Dec 9, 2017

Took me a while to get back to you, but essentially dplyr is fantastic for readability and reproducibility. Reading through someone else's analysis, or even my own long after the fact, is orders of magnitude easier than base R, data.table, or pandas typically are.

data.table's advantage lies in its speed. It is by far the fastest of the three options. In just about every benchmark it either is significantly faster than pandas or at the very least is approximately equal.

Pandas is lauded by people who strictly use Python, and it really is fantastic considering how ridiculous data manipulation would be in Python without it. But its also the only option a Python user really has, so they've become married to the idea that it is best.

Basically, if you are using Python, use pandas. If you have an option, go for data.table for speed, dplyr for clarity, or a mix of the two if desired.

disgruntledphd2 · on Dec 8, 2017

What I really like about dplyr is how simple it is. It essentially provides an SQL like selection of verbs (select, mutate, summarise, arrange) and handles lots of things for you. As an example, these two statements are equivalent:

mydf$newvar <- with(mydf, oldvar1/oldvar2)

mydf <- dplyr::mutate(mydf, newvar=oldvar1/oldvar2)

You can then use the pipe operator %>% to funnel the results of one operator into the next.

The real advantages is that you can easily build up a selection of functions which can be read from left to right (rather than right to left in summary(coef(mylm))) and the reduction in temporary variables.

Pandas, on the other hand looks like base R (which is fine, but not as nice as dplyr).

However, the niceness of pipes does all fall apart when you have an error in the middle and you need to start deleting things in order to debug.

makmanalp · on Dec 8, 2017

So in pandas it's kinda similar:

> df[newvar] = df[oldvar1] / df[oldvar2]

And instead of the pipe, we have chaining for which is super straightforward and readable:

> df[newvar] = (df[oldvar1] / df[oldvar2]).abs().rank().astype(str).str[:4]

and for more complex or non-chainable functions we have .pipe:

https://pandas.pydata.org/pandas-docs/stable/generated/panda...

which looks super similar to dplyr to me!

vhhn · on Dec 9, 2017

the data.table way:

mydt[, newvar := oldvar1/oldvar2]

I could not resist.

vijucat · on Dec 7, 2017

> People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.

Really? Maybe you worked with R before data.table, dplyr, and the tidyverse packages? I'm not that familiar with pandas in Python, but there is an incredible amount of productivity to be gained from knowing your way around a set of just around 5 packages in R that I never had when working with C++, Java, Perl, or Ruby.

Also, it could just be me, but I overcome the = and <- confusion by simply never using =.

czep · on Dec 7, 2017

I feel your pain! It took me a long time to get used to R. The only reason I tolerate it is I used SAS before that, so my point of comparison is an even more obtuse programming framework! Some general advice should you want to work with R some more:

- For assignment, always use '<-'. Read it as "set to". For example, "x <- runif(10)" means "set x to a vector of 10 uniform random numbers". When passing arguments in function calls, use '='.

- If the IDE gives you problems, try using the command line. R Studio or the R GUI app are not necessary. Simply type 'R' in a shell and you have an interactive read-line environment. Use the shell for exploratory work, then write code in your favorite editor and copy/paste after developing a series of commands you want to run.

- Use base R as much as possible, don't install a new package just for one function that you could do with base R functions, even if it's not elegant. Package bloat is one reason for inconsistencies in APIs. Some package developers will make you do x.getValue() and others getValue(x). But remember these are 3rd party packages. You can do a lot using just base R and a few select packages that are well respected (gglot2, dplyr, Hmisc, reshape).

bllguo · on Dec 7, 2017

That's funny, I absolutely never use '<-'. Mostly because it's 2 characters, and because it's inconsistent with most languages. Haven't ever run into any issues because of '='.

I like your base R point, although like you say, some packages are simply essential.

roel_v · on Dec 7, 2017

I complain about this every time a post on R programming comes up here, but my favorite thing to hate (our of many) about R is that there's no way to find out what the directory of the current script is. Imagine someone would want to use relative paths to their data files so that they could version control their scripts and run them unmodified on different machines! We wouldn't want to enable such abominations now would we!

phillc73 · on Dec 7, 2017

I think you need to reference the data files from the working directory, not the directory where the script currently is. The two aren't necessarily the same.

The current working directory can be found with getwd() and set with setwd().

If you set the working directory at the beginning of the script, paths to data files should be relative to that location.

roel_v · on Dec 7, 2017

Yes but for example when running from within RStudio, or calling from other scripts, the two aren't the same. Calling from other scripts you can do chdir() first of course, but my point is that you can't sensibly rely in your script on cd and script path to be the same.

twostoned · on Dec 7, 2017

I've actually noticed this and was totally blown out of the water by it. I understand you can use getwd() and setwd() but I thought you could simply do relative paths (similar to other languages) but it doesn't always work and I haven't figured it out.

For example, if you are loading a data.frame from a csv, my.df <- as.data.frame(read.csv("file.csv")) seems to work if the R script is in the same directory as the .csv. This is what I tend to do in .Rmd code chunks (which is my primary R workflow). It also tends to work across platforms which is handy as who knows what box I'm going to be hacking away on. However, R's preference for absolute paths in general I find very strange as I'm always on different machines with, of course, different directory structures. Isn't everyone?

Regardless, R is funky but I think I like it in a sort of awkward 'first date not sure yet' kind of vibe. I'm a noob and novice programmer otherwise though so who knows.

geomark · on Dec 7, 2017

Maybe I am misunderstanding your question, but isn't that just getwd()?

roel_v · on Dec 7, 2017

No, that gets you the working directory, which isn't always the same (like, when running from RStudio, getwd() returns the RStudio installation path IIRC).

perseiden · on Dec 7, 2017

if you run scripts non interactively, you could try commandArgs? That should contain the file path. For Rstudio maybe the rstudioapi package has a function like that...

roel_v · on Dec 7, 2017

Well yes, there are several workarounds; to the point that there are packages that wrap up all methods and try to decide which one is the correct one in the given invocation. This is the problem with R - there are many things for which you need only a single line to do something very complicated, but there are also many things that are just a tiny bit different from the standard cases, and are absurdly complex. Everything is just slapped together, without thought for the overall picture or overarching design.

Google stack overflow for 'R get current script path' some time, and weep not only at how often this is asked and upvoted (i.e., how many people suffer from this), but also at the suggestions offered - how divergent they are, and how complicated. But this is just one example. R is death by a thousand cuts.

perseiden · on Dec 7, 2017

not gonna argue that : D Been working with R for 3+ years now and totally second the "there are also many things that are just a tiny bit different from the standard cases, and are absurdly complex".

I can gradually move on to python at work now, which so far has been much more pleasant. It always surprises me what you can end up doing in R though, but really shouldn't if you want to go to production : )

thousandautumns · on Dec 7, 2017

getwd() returns the working directory, which can be set with setwd(), even from within RStudio. I'm still not sure what the problem is.

vertere · on Dec 7, 2017

Not sure if it solves your problem but `source(file, chdir = TRUE)` can be useful.

Yeikoff · on Dec 7, 2017

rstudioapi::getActiveDocumentContext()$path

I believe thats is what you are looking for.

roel_v · on Dec 7, 2017

Yes, and now I want to also make it work when not invoked from RStudio; and for various R version. So now I find myself wrapping all these options into a function, which I have to copy for every 10 line script. So then I make a package for it; or use the functions in someone else's package and add a dependency which I'm not sure will still work a year from now.

Or I could just use a sane language and go home in time for dinner.

(I mean I know about all the solutions and non-solutions; I've looked into this at least a dozen times over the last 5+ years. My point is that this shouldn't have been an issue in the first place.)

Yeikoff · on Dec 7, 2017

You are absolutely right, but then either you first post was missworded or I missunderstood the issue (most likely the latter), as there is a way to know the directory of the script.

100% agree with R is not a sane language.

roel_v · on Dec 7, 2017

Ah yes now I see - I said 'there's no way to find the current script' which isn't true. So that's probably what the others in this thread are also objecting against :) I guess what I meant was 'there's no same way' or 'look at how hard it is to do this tiny thingy which anyone with a programming background would find so basic, they wouldn't even consider it might not exist'. So yeah, I did screw up on making my point there.

_Wintermute · on Dec 7, 2017

So you need to have a specific IDE installed for this to work?

j_b_s · on Dec 7, 2017

Nope! Base R works great. Old-school vi to edit scripts, and R base installation to run them (or REPL around). Of course, the IDEs do offer a lot of support, and RStudio is great for making your R functions into packages that are easy to share.

_Wintermute · on Dec 7, 2017

That was in reference to the rstudioapi package for finding the path of the current file, which I've just checked out needs a running Rstudio session to work.

icc97 · on Dec 7, 2017

> - The ide crashes once an hour and does not detect file-changes which forces you to restart it manually.

There's more than one IDE for R [0], and strictly that's not a problem with R itself, but the people who built the IDE.

> - People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.

I think there's very little marketing behind R. It's predominantly a statistics package, but clearly a lot of statisticians are using it for data analysis. So I think it's people that using it for data analysis that talk about it, not bloggers paid to write about it.

[0]: https://stackoverflow.com/questions/1097367/what-ides-are-av...

tomalpha · on Dec 7, 2017

I’ve known many quants use both R and python/numpy/pandas for complimentary tasks. The R standard library was generally spoken about in positive terms, but for data massaging and manipulation beyond pure maths/stats analysis a python environment probably offers much more flexibility.

Note that I don’t claim expertise in the above, but a bunch of very talented people I’ve worked directly with, and who were very directly incentivized to be productive, used R.

Perhaps your profs were trying to help you learn R, including its limitations, when they were setting you tasks?

darkhorn · on Dec 7, 2017

Can you handle missing data with Python? I don't think so.

https://www.statmethods.net/input/missingdata.html

https://stats.idre.ucla.edu/r/faq/how-does-r-handle-missing-...

https://www.amazon.com/Statistical-Analysis-Missing-Roderick...

disgruntledphd2 · on Dec 8, 2017

This is a really big deal. In the first edition of Python for data analysis, they suggest using mean imputation. In case you don't know, this will totally break your variance calculations and thus any statistical tests.

In the second edition, they suggest doing some interpolation. Meanwhile, in R land there are multiple ways (as always) to do useful multiple imputation which gets you a much more accurate analysis which makes better use of all of the data (mice, Amelia and mi are all good, and somewhat complimentary).

That being said, I just thought of using PyTorch and a GAN to do multiple imputation, so maybe it's not impossible to do in Python. There is way, way less support for it though (but of course you could probably build in Numpy).

I guess the big difference is that R comes with numpy equivalent (matrix), a pandas equivalent (data.frame and base), and a well-tested, numerically-stable and reference implementation of pretty much all widely used statistical models.

Like, I really don't understand why you wouldn't want to look at residuals, even if all you care about is prediction. Your predictions will be much more stable and accurate, and it can often inform you as to how to model things more appropriately.

Finally, R's formula interface is a thing of beauty. Honestly, why the hell do I need to generate a model matrix for regression/classification when I can get R to do it for me.

I will also say that R is a frustrating, domain-specific, really irritating, wonderful language. But then I'm a crazy person, I wrote a stockfighter client in R.

realPubkey · on Dec 7, 2017

I agree that there are also some good parts with R.

But the argument "It's good because many people use it" is the one I heard most often when it comes to discussion about programming languages especially old ones like R and java.

thousandautumns · on Dec 7, 2017

Actually for data massaging and manipulation, R is absolutely superior to Python.

Yeikoff · on Dec 7, 2017

I do not really disagree with you, except for the '<-' bit, just map it to a keyboard shortcut, and move on :).

But I would give R a try with the tidyverse, it made me go from hating R to just not caring about it.

While libraries are extremely inconsistent, if you want to use cutting edge statistical methdos as a researcher, you pretty much have no other option. Finally, data wrangling is quite well developed in the R evironment.

So long story short, after many years of hating R, now I just find it a handy tool to do my work despite it being old, inconsistent and sometimes annoying.

mannykannot · on Dec 7, 2017

I don't get how a keyboard shortcut deals with the '<-' issue, which is the occasional and subtle difference in semantics from '='. Even if you can pick just one for your own work, it doesn't help with other people's code.

Tidyverse is new to me, I must check it out.

baldfat · on Dec 7, 2017

RStudio shortcut for <- is alt and -.

Also Tidyverse and data.table are the main reason for the sudden explosion of R's popularity. For me I love the piping since I am an old time bash user and | becomes %>% in R is the best thing for the way i think.

mannykannot · on Dec 7, 2017

Typing '<-' is a fairly trivial matter, I think, compared to the semantic issues raised at the start of this thread, and covered in the question and answers below - for example (from the chosen answer): "R's syntax contains many ambiguous cases that have to be resolved one way or another. The parser chooses to resolve the bits of the expression in different orders depending on whether = or <- was used."

https://stackoverflow.com/questions/1741820/assignment-opera...

Yeikoff · on Dec 7, 2017

Just use the '<-' all the time, except in function calls. If you can use '<-' with an easy key press you will not be tempted to use '='. And the problem is greatly reduced.

But yes, the problem is still there.

cshenton · on Dec 7, 2017

Add to that:

- documentation is all in PDF format

- can only install packages from the interpreter

- testing libraries not feature complete

- weird namespacing

- poor test coverage in popular packages

- no mature webserver

vertere · on Dec 7, 2017

The namespacing drives me nuts but it doesn't get mentioned very often in these kind of threads. How are people just ok with loading everything in the same namespace? You can use ::, but then that can have a ton of overhead.

disgruntledphd2 · on Dec 8, 2017

Yeah, the namespacing thing is crazy. If its any consolation, it was much, much worse before R 3.0. Originally, packages used to clobber each other's namespaces, which lead to much hilariousness and non-deterministic bugs.

Now those hilarious bugs only happen at the REPL, which is a little better. If these kinds of bugs cause problems for you, i strongly recommend creating packages for your analysis/projects. It doesn't add that much complexity (with devtools, at least) and it does avoid a lot of these problems. Also, R packages require documentation, which is better than many other languages.

ianhowson · on Dec 7, 2017

I've attempted to address the documentation issue at https://rdrr.io

The test coverage issue is what originally prompted me to get involved -- I was evaluating different EM solvers and found a lot of crazy obvious bugs (parameters backwards or totally ignored). There's a lot of room for improvement on the quality front. MRAN and Tidyverse are thankfully making some headway.

geomark · on Dec 7, 2017

can only install packages from the interpreter

Not completely sure this is what you mean, but you can certainly run install.packages("package") right in your code, doesn't have to be done interactively. Usually want to first check if it is already installed, like with require("package").

jrimbault · on Dec 7, 2017

I thing cshenton wants to install packages without running the R interpreter. Think of pip install, cpan, cargo...

kgwgk · on Dec 7, 2017

> can only install packages from the interpreter

If the package has been downloaded, it can be installed with R CMD INSTALL

yread · on Dec 7, 2017

nice list of R's WTFs. I've recently hit an issue that you access properties of "S4 classes" (what are those? "The S4 object system. R has three object oriented (OO) systems: [[S3]], [[S4]] and [[R5]].") using @ instead of $.

I think that the users/library creators are also guilty of why working with R is such a pain. Giving them the option to overload operator was a major mistake. C++ programmers are more often engineers who have more concern for the code reader and even they probably overuse it.

thousandautumns · on Dec 7, 2017

I don't know why you think its a marketing lie. I think R is hands down the best language for data analysis, and nothing else really comes close.

GordonS · on Dec 8, 2017

Maybe you could elaborate with why you feel this way?

hutzlibu · on Dec 7, 2017

"You can use '=' and '<-' to assign values to variables and both do the same, except in a few edge-cases where you now spend one week finding the error"

Can you tell us a bit about those edge-cases which can lead to hard finding bugs?

tugash · on Dec 7, 2017

- There is also '->' which can be even more confusing (or helpful if you use pipes)

- Aren't those methods defined by each package/object? Or you mean '@'?

- RStudio (the most used IDE) is one of the reasons that I use R so heavily. Never encountered your problems. I can even git checkout to another branch without problems and the new versions are loaded without problem.

- For a quick descriptive analysis or some tests I don't know something easier, compared to SQL or Python. But that's probably only personal preferences and/or knowledge of the language.

ocschwar · on Dec 7, 2017

I do R code QA for a living now.

And I can see why some people like R. They are end users for whom the language was explicitly designed, so they like the ergonomics (to use the term Rustaceans are popularizing.)

The thing is, like Perl and Latex and other products you could think of, R was initially written by people with a good idea of the end uses and how to enable those end uses, but not a good idea on how to reconcile those ergonomics with the need for a clean parseable syntax.

So if you make too extensive a reliance on R, you wind up having to hire someone like me.

baldfat · on Dec 7, 2017

Every language has its warts but R is actually the least disliked language https://stackoverflow.blog/2017/10/31/disliked-programming-l... and http://blog.revolutionanalytics.com/2017/11/r-is-the-least-d...

I really enjoy R and the more I learn programming the more I enjoy it. Best thing I ever did was learn the language Racket and How to Design Programs and Hadley Wickham's tidyverse.

I moved from Python to R about six years ago. Before that I did most of my work in the command-line. R's rise in popularity has been caused by the libraries in the tidyverse and data.table. The millions of dollars invested into R by many companies and an amazing eco-system

> You can use '=' and '<-' to assign values to variables and both do the same, except in a few edge-cases where you now spend one week finding the error

Its just R's symantics. It is almost universaly spoken to just use <- for style consistency and edge cases. Use the RStudio shortcut `Alt` and `-`. The reason you spend a week is the reason why it is recommended for all users to just use <-.

> It confuses and mixes functional programming and oop not only per entity but also between the usage of them. Want to get a value of entity X? use x.getValue(). Want to get a value of entity Y? Use Y.getValue(y).

R comes from S the creators of R were also inspired by Scheme. Personally I learned the language Racket to be a better R programmer and I pretty much live in the Functional side of R. I actually like the fact that they added more functional core to R from S+. http://r.cs.purdue.edu/pub/ecoop12.pdf

> The ide crashes once an hour and does not detect file-changes which forces you to restart it manually.

Then use a different system. R does not equal RStudio. I have never experienced this and I have worked in Linux, Mac and Windows 7 - 10. To me RStudio is the best example of an electron app and the only IDE that I actually use the built in git feature. R Projects and RWorkbooks are the best features of RStudio.

> People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.

So R is just as easy to use for data analysis and has best in class statistics? Also I think tidyverse is much easier than any other data analysis system I have ever seen, but I guess that is just my opinion.

> It's a marketing-lie spread by the creators.

What does Ross Ihaka and Robert Gentleman have to gain for marketing and what lie has they ever said. This falls into conspiracy theory.

There is a large community of R users and we like R. There are a ton of

dcl · on Dec 7, 2017

Is there anything out there comparable to ggplot2 for high-quality plots?

jihadjihad · on Dec 7, 2017

Lightning (http://lightning-viz.org/) is a pretty cool interactive visualization server, with clients that work across multiple languages/environments. I've used it with R and Python and it's pretty slick.

icc97 · on Dec 7, 2017

http://ggplot.yhathq.com/

dandermotj · on Dec 7, 2017

I'm always a bit disappointed that yhat indiscriminately used 'ggplot' for the python package name. Using a variation on the name would have been more considerate.

ianhowson · on Dec 7, 2017

I tried a few years ago and found that it didn't implement much and didn't work correctly. Since then I usually use something like rpy2 to run standard ggplot2 from Python.

Hopefully it has improved.

_Wintermute · on Dec 7, 2017

matplotlib makes fantastic plots, it's just not a very nice API.

Seaborn is quickly becoming my favourite and a bit more similar to ggplot in terms of scope.

j88439h84 · on Dec 7, 2017

plotnine.readthedocs.io/en/latest/api.html

Gatsky · on Dec 7, 2017

tldr: you don't like R. I am no wiser than before I read this comment. So what do you use instead?

Regarding your 3rd point, I have never seen or heard anyone say R is the best, and I use it almost daily.

Zariff · on Dec 7, 2017

So you prefer Python, I assume?

realPubkey · on Dec 7, 2017

Depends on the problem to solve. For calculation-heavy tasks i prefer python. But these days many use R to create shiny-apps which should be done in javascript instead.

Gatsky · on Dec 8, 2017

That's one way to look at it. The other perspective is that with shiny 'many' are able to create interactive apps to display and explore data which would have required 5x as much time to make with js +/- d3

sgt101 · on Dec 7, 2017

Shiny does use javascript; are you against the wrap?