This is interesting, but not really an R vs. Python comparison. It's an R vs. Pandas/Numpy comparison. For basic (or even advanced) stats, R wins hands down. And it's really hard to beat ggplot. And CRAN is much better for finding other statistical or data analysis packages.
But when you start having to massage the data in the language (database lookups, integrating datasets, more complicated logic), Python is the better "general-purpose" language. It is a pretty steep learning curve to grok the R internal data representations and how things work.
The better part of this comparison, in my opinion, is how to perform similar tasks in each language. It would be more beneficial to have a comparison of here is where Python/Pandas is good, here is where R is better, and how to switch between them. Another way of saying this is figuring out when something is too hard in R and it's time to flip to Python for a while...
I didn't get it working on my Linux machine, but you will definitely see some pull requests once I have time to fiddle with it. The electron version is a nice idea but I would prefer better instructions for installing the normal version. "This script will do it all" is not always helpful.
Thanks for the report. Yea it is not easy to install on linux unless you use the docker version. There are many dependencies and PPAs required in the script because it does everything.
We are working on better linux packing and distribution (see our issue tracker), but it is not easy to do it right, and it will take a while.
FYI - I tried one of the Mac all in one downloads, and it looks promising. However, all I get are status messages saying that it is waiting for Python or R to initialize...
Thanks. We don't have an all-in-one download, you have to install Python or R separately. But if you already have them, it should just work if they are in your PATH, and that path is setup by .bash_profile? Did you install the required R packages? Do you have IPython (not just Python)? We can probably better debug this by email or as a github issue than in this forum.
They're quite different, though, and I can see why many prefer ggplot. It's a declarative, domain-specific language that implements a Tufte-inspired "grammar of graphics" (hence the gg- in the name; see section 1.3 of [1], and [2,3]) for very fast and convenient interactive plotting, whereas matplotlib is just a clone of MATLIB's procedural plotting API.
I've waxed lyrical about Python all over this thread, but here you have to give the medal to R. Matplotlib is one of my least favourite libraries to use, been doing it for almost 2 years, and I still spend half my time buried in the documentation trying to figure out how I'm supposed to move the legend slightly to the right or whatever.
ggplot probably has slightly less flexibility overall (mpl is monolithic), but for just doing easy things that you need 99% of the time, ggplot is king.
There is a gpplot clone in python. Also bokeh is starting to develop a grammar of graphics interface. Then there is seaborn and mbplot. Lots of stuff besides mplotlib
I must give you that after few years of using it, I still have to look for documentation for elementary things.
I am not familiar with ggplot, so I wasn't comparing them on the ground of the easiness of use, but by looking at some ggplot examples, they looked like something you can do with matplotlib, too, so I pointed that option out, too.
I was referring to the article title that is was an R vs. Python comparison. Python is so much more in terms of a general purpose language than R is. Similarly, R is much more in terms of stats (built-in) than Python. I just thought that it would be more accurate to call the article an R vs. Pandas/NumPy comparison.
Even though both of them need an extra plotting library to make publication quality plots. Matplotlib isn't bad by any means - and it's gotten better over the years. But R/ggplot2 produces nicer plots (IMO). I'm not sure that I'd export data from Python into R just for ggplot, but I might.
On paper perhaps, less so in application. Sure you can probably make matplotlib do everything ggplot does with enough work, but working with ggplot is just so much quicker easier and more fun.
And I say that as someone who does all his data analysis in Python.
I don't have a lot of experience with either, but I was close to really digging in and learning R just for the ease of use of ggplot.
I tried the ggplot for python (ggplot.yhathq.com/) but eventually settled for seaborn (http://stanford.edu/~mwaskom/software/seaborn/). It is really quite easy to get most of the common plots that I wanted and hasn't let me down yet. The standard plots look SO much better than the standard plots of MPL without a lot of customization.
Well, for scientists wanting to publish, GGplot it's quite unpractical. Most of the time we have to publish in B&W magazines and GGPlot simply lacks the capabilities to do so properly (por instance B&W filling patterns).
Matplotlib with some good definitions ends up providing much better results and nicer looking plots fro B&W unlike what people normally think.
... and I remembered why I don't use ggplot at all, thanks. After lots and lots of plots done with R, I was starting to feel a bit weird reading the comments.
R has pandas/numpy/scipy integrated in the language (for the most used features at least), but that doesn't make much of a difference because any person that wants to use these tools will do a quick "pip install" to grab them. (which is pretty fast with the new Wheels system)
Out of curiosity, why do you consider CRAN to be much better than PyPI?
I'm only thinking about CRAN > PyPI in terms of statistical packages. CRAN is where new statistical analysis techniques / packages are initially published. If you're lucky they might get ported to Python after the fact. I didn't even mention Bioconductor, which is another beast entirely. There isn't an equivalent of Bioconductor for Python at all.
And the last time I checked, "pip install numpy" could be quite a pain, especially if you needed to compile dependencies. Rstudio makes it ridiculously easy to install R and add packages.
However - for all other types of packages, PyPI is obviously superior. The breadth of packages on PyPI is much better than CRAN.
R is certainly a unique language, but when it comes to statistics I haven't seen anything else that compares. Often I see this R vs Python comparison being made (not that this particular article has that slant) as a come drink the Python kool-aid; it tastes better.
Yes; Python is a better general purpose language. It is inferior though when it comes specifically to statistical analysis. Personally I don't even try to use R as a general purpose language. I use it for data processing, statistics, and static visualizations. If I want dynamic visualizations I process in R then typically do a hand off to JavaScript and use D3.
Another clear advantage of R is that it is embedded into so many other tools. Ruby, C++, Java, Postgres, SQL Server (2016); I'm sure there are others.
I'd say R is a _terrible_ language. Its types are just really different from every major programming language, and it's horrible for an experienced programmer to use.
I totally agree that R has fantastic libraries, but I'd like to see people focus on improving libraries for Python rather than sticking with R, which as a language is less well-designed than Python.
[I use R for most of my stats, I also use Matlab and Python]
I think you're wrong. R is an excellent language, targeted specifically around the problems you commonly see when doing data analysis. On the whole the standard libraries aren't particularly good, but I think the language is good.
That said, the language is often taught poorly. Here's my attempt to do better: http://adv-r.had.co.nz
(where you already commented, so it's not like this is something new...)
I would say that any language that does not have a facility to get the path of the current file, is not 'excellent' under the criteria an experienced programmer would use for assessing it.
Now, I very well know that those criteria are different from what scientists use, but still...
I think R is a great language for certain applications - namely statistics and some data analysis. Your work has certainly made it better.
However, from a computer language design point of view - it leaves a lot to be desired. It's type system is seems very complicated and while the language tries to do what it thinks you want, it's not always clear what is going on (are you working on a matrix or a dataframe that has been cast into a matrix?).
For me, R is one of those languages that is good in a certain domain, but once you get out of that domain, it makes things more complicated than they need to be. It just isn't a general purpose language. By far, the biggest problems I've seen have been people who only know R (mainly stats people or biologists) try to do something in R that would be a quick 10 line Python/Perl/Ruby/whatever script.
Normally for a language design, you aim to make easy things easy, and difficult things possible. For R, it seems like it makes difficult things easy and easy things difficult. Maybe that's the tradeoff that was needed. :)
That said - please keep doing what you're doing. You've made my R work vastly easier.
I'm not qualified to comment on how good or bad a language R is. But it is maddening how package developers don't follow some convention for naming functions. I load a package that I haven't used recently and I know the function I want but can't remember if it is called my_function, myFunction, my.function, or MyFunction. Google published an R styleguide, https://google-styleguide.googlecode.com/svn/trunk/Rguide.xm.... Does anybody follow it?
Hmm what do you mean about the types being different?
My experience was exactly the opposite -- first time I saw R syntax (actually, it was S-Plus back then...) , I thought it was the most intuitive and powerful system I've ever seen -- this was after fairly extensive experience in C and C++, as well as a few others.
Now, I don't quite think so any more, because there are many rather tricky things buried under the surface (e.g. how many people really understand how exactly environments work?) -- but the majority of R programmers will never have to deal with them in their code...
Also, I have definitely done general-purpose coding in R -- for a lot of things it is completely adequate. Python has more general-purpose functions and libraries of course, similarly to how R has more statistical ones.
I've used python for years, decided to teach myself R for a masters class I'm taking.
I have to disagree. Its main model is generic function method dispatching. It can feel odd at first to someone coming from the C++ style of OO where objects own methods, not methods owning objects. But it's a legitimate OO style with its own advantages. [1]
I've found the more I use R, the more intuitive a lot of its operations are. It's relatively easy to "guess" what you ought to do to accomplish what you want. More so then other languages I've learned.
When people argue that R is terrific language, I remind them that it has 4 (four) objects systems which differ in subtle ways between each other. It's programmers' nightmare.
It's not the worst language in the world, but it isn't terrific language either.
I'd also say the CRAN repository is awful, it discourages collaboration, and is typically written by small groups of academics who write the worst documentation I have ever seen.
I blame the R documentation standards. They force a package author to produce a useless alphabetically-listed pdf, and many people just stop at that point.
Without any standards at all, people would have at least produced a readme.txt, which would have been a huge improvement -- e.g. I much prefer working with unfamiliar user-written Matlab packages :)
I don't know why so many people complain about R documentation, I think it's pretty good. The PDFs are useless for sure, but you don't have to use that. Emacs displays documentation pages in a split window. Or you can use a web browser.
I am (sort of); but most packages don't have vignettes. Zoo and ggplot2 (and a few other major packages) have great documentation, but they are an exception.
Just to toss another name into the ring, I'd say that Fortran is pretty suitable for numeric calculations of all sorts.
I like R as a higher level language (or I guess tools like SPSS or preferably PSPP for even higher level stuff). These days I do most of my academia stuff with R (mostly hypothesis and equivalence testing and the things related to it like power analysis etc.)
I've never really looked into Python which is strange because I use it as a "glue language" quite often. I think I'll investigate Python a bit more next time I have to actually collect and clean up the data before using it. Right now I'm more of a consumer (mostly using data from our experiments that are turned into CSV)
Absolutely; modern Fortran is great and is syntactically rather close to Matlab (and to an extent R as well).
The main difficulty with Fortran is IMO the lack of an extensive standard library -- sure, you can find code out there to do almost anything, but then you need to figure out linking/calling conventions/possibly incompatible data models for each new library you bring in...
But, as another poster mentioned, it is quite straightforward to call Fortran from R :)
>
Just to toss another name into the ring, I'd say that Fortran is pretty suitable for numeric calculations of all sorts.
> I like R as a higher level language (or I guess tools like SPSS or preferably PSPP for even higher level stuff). These days I do most of my academia stuff with R (mostly hypothesis and equivalence testing and the things related to it like power analysis etc.)
You can see R as some sort of glue language around libraries written in lower languages like C++, C or Fortran (I believe a large part if not all the functionalities for matrix operations used by R for linear regressions and statistican analysis (PCA) is written in Fortran).
Fortran code runs much faster, but you don't want to use it to do exploratory analysis ("I have those data about people, what if I filter out the people earning more than X before checking if there is a correlation between the average age where men get married and their incomes?").
Could you provide an example in stat analysis where python is clearly inferior? In the article, R seems to have an advantage of having many useful stat functions baked in vs having to import specific modules in python. im wondering if your proficiency in R is being weighed in your evaluation of R - maybe python's statistical analysis tool has many to offer, but you are more aware of R's toolsets.
I'm primarily a Python user and can say that there's no contest that R has many packages that Python does not have an equivalent of yet. This includes stats stuff and especially finance/trading. Definitely not a showstopper for me but if I were to recommend one or the other to people at work with no programming skills, I would have to choose R for the breadth of existing packages.
>>but if I were to recommend one or the other to people at work with no programming skills, I would have to choose R for the breadth of existing packages.
My 2 cents: If someone has no programming background, then building a foundation from python will allow them to do much much more than building a foundation on R--unless of course they only care about statistical analysis and have no inclination to code more generally. I learned both at the same time even though I had no use for Python at the time (was and still am a professor) but I use it almost everyday now and very much enjoy it!
Agree completely. Should have qualified that with most at my spot/industry(finance) would be using it as an Excel replacement and just want to get things done; hence the value of existing packages.
Also, ML academics tend towards R for reference implementations of novel algorithms. They are often available in R first. This cuts both ways; sometimes the Python implementation that comes later misses some subtleties of the R implementation that the original authors nailed, and other times the R implementation is a proof of concept, while a later implementation is more real-world ready. But the latest and greatest tends to be available in R long before it has made its way into e.g. SciPy.
Great comparison. However, I find R's syntax as obtuse and baroque. Like a shovel with a compartment that carries tweezers. Advocates tend to argue that for moving dirt, this 'R' shovel is far more precise than an ordinary 'Python' shovel. But Python is in fact more like the toolshed from which both tools are housed plus a whole lot more.
Well yeah, and I use them, but they're a bandaid over the fundamental problem that just like in Perl, in R TIMTOWTDI. It's the classic 'we have 12 standards, time to make a unifying one - now we have 13' problem. I've sort of gotten used to it now, but it was majorly difficult at first for me (after having programmed for nearly 20 years) to get used to the concept that any task can be done in 20 different ways, each one just as 'valid' or 'easy' or 'maintainable' as the others. At least in C++ there are 20 bad ways to do something, and one good one - the way that Sutter covered in his columns. I know it's not quite fair to compare 'just' the C++ programming language to R and all its packages, but still.
Just curious, what in particular did you find obtuse?
It's not like R does not have obtuse and baroque parts, it certainly does, and their obtus-ity is rather high, but IMO they are not parts of the language a casual user would likely encounter...
On the other hand, Python has quite a few pitfalls itself -- but I suspect a casual user would, for example, run into Python default arguments a bit sooner than she would run into R environments :)
R is a wonderful language if you chose to get used to it. I love it. I've even used R in production quality assurance to check for regressions in data (not the statistical regressions). I see countless R posts where people try to compare it to Python to find the one true language for working with data. Article after article, there clearly isn't a winner. People like R and Python for different reasons. I think it's actually quite intuitive to think about everything in terms of vectors with R. I like the functional aspects of R. I wish R was a bit faster but I am pretty sure the people who maintain R are working on that. You can't beat the enormous library that R has.
I also LOVE R. Plus the fact that Microsoft and other corporations are supporting R will help more and more. With Hadly Wickham's universe it is a great place to do all your work.
I spent a few weeks a few months ago learning R. It's not a bad language, and yes, the plotting is currently second-to-none, at least based on my limited experience with matplotlib and seaborn.
There's scant few articles on going from Python to R...and I think that has given me a lot of reason to hesitate. One of the big assets of R is Hadley Wickham...the amount and variety of work he has contributed is prodigious (not just ggplot2, but everything from data cleaning, web scraping, dev tools, time-handling a la moment.js, and books). But that's not just evidence of how generous and talented Wickham is, but how relatively little dev support there is in R. If something breaks in ggplot2 -- or any of the many libraries he's involved in, he's often the one to respond to the ticket. He's only one person. There are many talented developers in R but it's not quite a deep open-source ecosystem and community yet.
Also word-of-warning: ggplot2 (as of 2014[1]) is in maintenance mode and Wickham is focused on ggvis, which will be a web visualization library. I don't know if there has been much talk about non-Hadley-Wickham people taking over ggplot2 and expanding it...it seems more that people are content to follow him into ggvis, even though a static viz library is still very valuable.
Thanks...I didn't know that (though I had been paying attention to bug fixes)...but my point exactly, he's prodigious, so maybe "maintenance mode" to him is "major features every 3 months instead of 2) :).
Also worth pointing out, he's actively working on a new book for ggplot2, which, AFAICT, he's providing for free (you just have to run the build tools)
I used to work a lot with R many years ago. I was shocked to find how bad the documentation was, and worse how rude and unfriendly the "community" of grumpy professors was. I shudder to think of the horrible meanness towards beginners asking questions on the mailing list.
I got so fed up I even wrote a book about R data visualisation. But this was all just around the time ggplot2 came out. Unfortunately I stopped using R soon after, but since then Hadley has single-handedly done more good for the language than anyone else.
I don't know what the R community is like now, and whether people like Hadley have made it friendlier, but it's clearly one reason Python is superior.
I'm a late arrival to the language and have almost interacted with it exclusively through StackOverflow and Github. I've been astonished at not just how friendly people are, but how quickly I can get a helpful response to even what I feel are pretty esoteric (and dumb) questions...again, one of the problems of coming into R is that, because of the relatively small community, there aren't as many references or easily Googlable answers compared to Python...but getting answers to questions if you ask them is very easy, and I think that's a credit to the community.
On the other hand, there seem to be a lot of useful libraries that haven't been ported over to Github or are otherwise easily accessible beyond CRAN...Many of them probably don't get as much exposure as they would if they were more easily discoverable...and I honestly don't even know where, in those cases, to start the bug reporting/patching process. That's obviously the fault of my being spoiled by Github...but that's kind of the point, there's a bit more friction in contributing to R than you might find in Python/Ruby/etc.
The caveat on the ggplot2 book is that building it seems to be really hard because of the nightmare of cross-platform latex. But there will be a physical book out early next year.
Does every language have many of its main third party packages that are heavily influenced by the work of one person? Wickham is to R as John Resig is to JavaScript, if Resig were to have also created and primarily maintained D3, moment.js, and Grunt...Wickham not only steers the libraries that define how a growing majority of R users do data manipulation (dplyr) and visualization, he's also building the tools he needs to maintain and publish them (devtools).
This isn't to say that there aren't other programmers doing brilliant work in R (also, R is just a smaller community overall), but he's devoting significant time to building out support tools and frameworks...this suggests that he is a total mensch, but also that there was a significant need that hadn't yet been addressed.
It does help that I'm one of the few people who are paid to work full-time on nothing but open source R packages that a designed to broadly aid data analysis.
I would argue that most of the scientific & statistical packages for most languages are driven by at most a handful of people, yes.
Another interpretation is that R is an incredibly productive language for this sort of programming, otherwise one person couldn't write so much useful code. ;)
This is just a series of incredibly generic operations on an already cleaned dataset in csv format. In reality, you probably need to retrieve and clean the dataset yourself from, say, a database, and you you may well need to do something non-standard with the data, which needs an external library with good documentation. Python is better equipped in both regards. Not to mention, if you're building this into any sort of product rather than just exploring, R is a bad choice. Disclaimer, I learned R before Python, and won't go back.
Exploring the data is maybe 99% of what data analysis is about. It's very much a trial and error process that can't be planned in advance, and R is in my opinion much better suited for that, with a better interactive interface, plotting system and statistical libraries.
On the other hand, if you know the exact calculations that you need to do and the results you're gonna get, then Python might be a better tool.
Personally I learned R after Python, and I use both languages, but I prefer R for anything involving statistics.
I think there are lots of good R libraries for getting data from various places: DBI (databases), haven (SPSS, Stata, SAS), readxl (xls & xlsx), httr (web apis), readr/data.table (flat files). (Disclaimer: I wrote/contributed to a lot of those).
I'm currently using both R and Python, having previously only used Python. At first I didn't like R for general purpose data munging and web scraping. That was before I discovered a few R packages that make it a breeze. And now it's a toss up for me. If it's an interactive data product that I'm buiding I probably go with R. If I need data from an API and the supplier gives me only a Python sample script for accessing it I'll go with Python.
Recently started using rvest for web scraping. Sweet bejeezus that's a pleasure. I would've never considered R for scraping before. It was always Python with BeautifulSoup.
I would also check out dplyr for data munging. Since most of the code in dplyr is written in C++ it is much faster than the munging capabilities you probably used when you were using R years ago.
Hmm I am curious, how would you do data cleaning without doing data exploration first -- and in what way do you find Python superior to R for that purpose?
Also I assume that by "something non-standard" you mean something other than a way to analyze it? Because there is really no comparison wrt available analysis packages between the two...
Not trying to say that R is perfect and great for everything, definitely not, I just have a hard time imagining a data-processing task for which I would choose Python over R (I might pick SAS over either one of them though...)
How does R compare to SAS? I work in Engineering and we use SAS pretty heavily for a lot of stuff (simple modelling, time series forecasting, multiple regressions that type of thing). One thing I really like is how well integrated SQL is does R have something similar to PROC SQL? That is really the killer feature of SAS for me.
I use SAS professionally at my job, and R in all my academic/hobby work. R has a couple packages that give similar functionality as PROC SQL (about 95% of my SAS workflow, since it's far nicer than data steps for a lot of things). There's an ODBC package (RODBC), as well as SQLDF, which allows you to use SQL queries to manipulate data frames in R.
While there is (almost?) always a way to do a SQL query using idiomatic R, I have to admit that sometimes my brain thinks up a solution in SQL faster (a product of upbringing).
I have to agree that Python is more powerful, and I am indeed doing more and more in Python. Python was my first language, before R.
However when the dataset is medium sized (i.e.: fits into your computer's memory / 2) R crushes Python (and Pandas) for the 80% of the time you'll be spending wrangling. The reason is that R is vector-based from the ground up. Pandas does everything that R does, but does it in a less-consistent, grafted-on way, whereas the experienced R person who "thinks vectors" is way ahead of the Python guy before the analysis has even started (i.e., most of the work). I know both really well. I use Python when I want to "get (semi) serious" production wise (I qualify with "semi" because if you're really serious about production, you're probably going to go to Scala).
But when it comes to taking a big chunk of untidy data and bashing it around till it's clean and cube-shaped, will parse, and has no no obvious errors, R is miles ahead of Python. R is where you do your discovering. Python can do it too, but I would estimate the cognitive overhead as double.
By the way, that's why people who "think time series" all day long (i.e., vectors, not objects), and who want to implement their algos, not think CS, will first typically build it in R, which is why CRAN beats Python all the time and every time for off-the-shelf data analysis packages. Data people go to R, computer-people go to Python (schematizing).
R is slow. That's its main problem. And that's saying something when comparing it to Python! But the gem of vector-everything makes it a much more satisfying language than imperative, OO, Python, when it comes to the world of data first, code second.
Finally I'd add that Python 3.x is arguably distancing itself from the pragmatism which data science requires, and 2.x provided, towards a world of CS purity. It's not moving in a direction which is data science friendly. It's moving towards a world of competition with Golang and Javascript, and Java itself.
If you haven't already, you might want to take a look at Julia. It's extremely fast, and has more native support for vectors than Python. It's still immature, but I think it has great potential as the truly great language for scientific/data computing.
Vector operations are not slow - they are basically the same as python/R (compiled down to C).
However, devectorization (i.e. replacing vector ops with a for-loop) is sometimes a performance improvement because Julia can usually provide C-like speeds in for-loops and avoid creating intermediate arrays.
Julia's for loops are comparable to C in performance, and its vectorized operations are comparable to Numpy/R, although some cases can be optimized using https://github.com/lindahua/Devectorize.jl (see the benchmarks table)
had worried me a couple of years ago. JMW shows that vectorized was much slower also in Julia (though still both faster than R - but that's not difficult).
Glad to see Julia is very fast in both cases, though it's still somewhat perplexing the extent to which vectorized code is necessarily slower. I'm thinking that the future of GPU enabled languages will mean vectorized code will be faster, so I prefer languages with a bias towards vectorisation.
it's still somewhat perplexing the extent to which vectorized code is necessarily slower
The vectorized code typically allocates all kinds of intermediate results (more GC, more memory accesses). Apparently, turning it into loops is less trivial than it seems.
I'm thinking that the future of GPU enabled languages will mean vectorized code will be faster, so I prefer languages with a bias towards vectorisation.
I share that concern. Julia has some libraries to support GPU programming, but I don't know of any plans to have the core compiler take advantage of it.
Do you mean they use Python because it's faster? yes sure. But then, just use scala. 10x faster again. With a REPL.
Perhaps I should clarify, I'm talking mainly time series and/or data which is vectorizable. Python is better if you're scraping the web. If there's a lot of if else going on. Ie imperative programming.
R's native functional aspects (all the apply family) and multilevel vector/matrix hierarchical indexing is better built from the ground up for large wrangling of multivariate datasets, in my opinion.
I agree with your critiques of Python... Could you please post some example of code/operations which are very natural in R but unnatural in Python/Pandas? I'm curious to see what I'm missing out on.
Well, I use both, and I can do everything in Python that I can do in R. However here are some things which will give you a flavour of R's more consistent, data-first nature:
> rollapply(some1000x10matrix, 200, function(x) eigen(cov(x))$values[1], by.column = FALSE) # get the first eigenvalue rolling 200x10 window.
>>> # impossible in Python unless using ultra-complex Numpy stride tricks.
> dim(someMatrix)
>>> someMatrix.shape
> head(someMatrix)
>>> someMatrix.head() # notice consistent function application in R, whereas in Python, mixed attribute / function? So we're on OO land and I must know if it's an attribute or a function....
> rollapply(some1000x2matrix, 200, function(x) {linmod <- lm(x[, 1] ~ x[, 2]); last(linmod$residuals) / sd(linmod$residuals)}, by.column = FALSE) # get the z score in one multi-step function.
>>> Impossible in python without For loop as lambdas cannot be multi-statement.
> native indexing using [] brackets by index number, or index value, or boolean. All vectors.
>>> pandas loc/iloc/ix mess.
> ordered lists (python dict) by default, so boolean or index subsection easy even when data is hierarchical, not tabular
>>> easy bugs due to unordered nature of dicts; must import some different module and then still can't vector index it.
And then there's CRAN. Just last night someone told me about "nowcasting" which uses "MIDAS regression". A relatively new technique. Google it for R (full package available), Google it for Python (Matlab comes up ;-).
And I'm not even going to start on graphics. Seaborn and bokeh are valiant efforts, but they're still 80% of what ggplot and base graphics can do, especially, at the multidimensional scale. That last 20% is often all the difference between meh and wow. That said, I do appreciate Matplotlib's autos rescaling of axes when adding data. Python charts aren't as pretty nor capable of complexity (for similar effort), but they're arguably more dynamic.
Now don't get me wrong. The converse list for Python would be much longer, because it's more general purpose, and it kills R outside of data science. I wrote 10k loc in R for a semi-production and it was horrible because it does not have the CS tools for managing code complexity, and it really is slow at certain things. R is more focused on iterative, exploratory data science, where it excels.
R _is_ object oriented. But it uses generic function style of OO, rather than message passing, which you're probably more familiar with. (Interestingly Julia also uses generic function style OO)
The reason I like R - it just makes data exploration and analysis too damn easy.
You've got R Studio, which is one of the best environments ever for exploring data, visualisation, and it manages all your R packages, projects, and version control effortlessly.
Then you've got the plethora of packages - if you're any of the following fields: statistics, finance, economics, bioinformatics, and probably a few others, there's packages that instantly make your life easier.
The environment is perfect for data exploration - it saves all the data in your 'environment', allows you to define multiple environments, and your project can be saved at any point, with all the global data intact.
If I want some extra speed, I can create C++ modules from within R Studio, compile and link them, as easily as simply creating a new R script. Fortran is a tiny bit more work, still easy enough however.
Want multicore or to spread tasks over a cluster? R has built in functions that do that for you. As easy as calling mcapply, parApply, or clusterApply. Heck, you can even write your function in another language, then R handles applying that over however many cores you want.
Want to install and manage packages, update them, create them, etc...? All can be done from R Studio's interface.
Knitr can create markdown/HTML/pdf/MS Word files from R markdown, or you can simply compile everything to a 'notebook' style HTML page.
And all this is done incredibly easily, all from a single package (R Studio) which itself is easy to get and install.
Oh yeah, visualisation, nothing really beats R.
And while there are quirks to the language, for non-programmers this isn't really an obstacle, since they aren't already used to any particular paradigm.
As for Python, I'm sure it's great (I've used it a little), but I really don't see how it can compare. R's entire environment is geared towards data analysis and exploration, towards interfacing with the compiled languages most used for HPC, and running tasks over the hardware you will most likely be using.
I like Python better as a language, but Python's libraries take more work to understand and the APIs aren't very unified. R is much more regular and the documentation is better. Even complicated and obscure machine learning tasks have good support in R. BUT the performance for R can be very, very annoying. Assignment is slow as all hell and it can often take work to figure out how to rephrase complicated functions in a way that R can figure out how to do efficiently. I think being much more functional than Python works well for data. I mean the L in LISP stands for list! Visualizations are also easier and more intuitive in R, too, IMO. Especially since half the time you can just wrap some data in "plot" and R will figure our which one it should use.
I think the conclusion of the article is correct. R is more pleasant for mathier type stuff, while Python is the better general-purpose language. If your jobs involves showing people powerpoint presentations of the mathematical analysis you've done,you'd probably want to use R. If, on the other hand, you're prototyping data-driven applications, Python would probably be better.
That said, I really like Julia, but can't justify really diving into it at this point. :\
> prototyping data-driven applications, Python would probably be better
I would disagree. Python's libraries are really reimplementing R in Python (Mainly Pandas). I find R to be very flexible and especially in the last 5 years with Hadley Wickham's libraries things are concise and very powerful.
Code in R can look like this beautiful code (If you don't code in R and I would expect anyone can see what is happening) This is why I disagree that prototyping in Python would be better.:
There are now methods in pandas to do pretty much anything, so you can chain them together into one easy-to-read manipulation without lots of intermediate variables.
If you only have time to learn one language, learn Python, because it's better for non-statistical purposes (I don't think that's very controversial).
If you need cutting-edge or esoteric statistics, use R. If it exists, there is an R implementation, but the major Python packages really only cover the most popular techniques.
If neither of those apply, it's mostly a matter of taste which one you use, and they interact pretty well with each other anyway.
I'd say, if most of your job is analyzing the data yourself and trying to make sense of it, R wins hands down. Particularly if statistical graphics or advanced statistical methods may be needed, but it's still the case even if they won't.
If most of your job is going to be implementing data analysis techniques that you or someone else has done earlier and putting things into production, then Python will quite possibly be more suitable.
"If you only have time to learn one language, learn Python, because it's better for non-statistical purposes (I don't think that's very controversial)."
Actually, it is. When someone has only 3 or 4 years to finish their thesis and learning how to program is secondary at best, and they have to do it in a math-heavy department or field, there is no time or use to learn Python.
R does not mean only esoteric statistics. You have many more utilities in the R packages to diagnose and select models. Fitting a model is like 1% of the work, diagnostic is the more important part and R has much more to offer than Python ever will.
I have always considered R the best tool for both simple and complex analytics. But, it should not go unmentioned that the features responsible for R's usability often manifest as poor performance. As a result, I have some experience rewriting the underlying C code in other languages. What one finds under the hood is not often pretty. It would be interesting to see a performance comparison between Python and R.
I would say they're both as limited as Python, Julia far more so. R's stats packages get ported to Julia faster, though. Mathematica still can't do mixed generalized linear modeling, and no other language (other than SAS and Stata) has a package for analyzing simple effects within them.
I have found Renjin quite useful in the past, and I love the motivation behind the project. I know that the guys at Bedatadriven hope to improve upon its performance, however it does not always (or often, depending on how you use R) outperform GNU R. Some great changes have been made lately (http://www.renjin.org/blog/2015-06-28-renjin-at-rsummit-2015...), so I hope to see Renjin's performance progress beyond GNU R across the board. I actually contributed Renjin's current PRNG – a Java translation of GNU R's – which was my first experience getting under R's hood.
The Purdue project you linked looks quite interesting. Unfortunately, development appears to have stagnated: https://github.com/allr/purdue-fastr
[edit]
Another important aspect that Renjin contributes is the packages ecosystem: http://packages.renjin.org/
R also has tools to spread tasks over multiple cores or over a cluster quite effortlessly. In practice, I can create a Fortran or C++ module, then use R to apply it over multiple cores, and get fantastic performance for certain tasks.
The one thing that sometimes gets overlooked when people decide whether to use R or Python is how robust the language and libraries are. I've programmed professionally in both, and R is really bad for production environments. The packages (and even language internals sometimes) break fairly often for certain use cases, and doing regression testing on R is not as easy as Python. If you're doing one-off analyses, R is great -- for anything else I'd recommend Python/Pandas/Scikit.
One nice thing about Python is that you can make a piecewise transition from Python -> C, as it is fairly trivial to wrap C code for use in Python. On the other hand, Java's C interface system JNI is pretty much universally reviled.
Good point, but personally I am thinking about the future of clustered data analysis, and this seems to be a JVM world and Scala seems to be the language of choice. Flink / Storm / Spark etc.
Yes Dask looks good! It's definitely featuring in my "must consider" list, but I must also, for reasons of responsible planning, give a lot of weight to the JVM technologies, with all their corporate backing etc.
I'd love to hear what precise production problems that you're seeing. I know people are successfully deploying R in production, but I'd like to hear more about the challenges.
First let me say thank you for your work on R packages, you've helped a lot of people accomplish some great things!
Unfortunately I can't go into specific details without potentially divulging proprietary information, but broadly most of the issues I've seen in production with R are corner cases involving multithreading with large amounts of allocated RAM (over 100GB), and corner cases involving the data.table package. I've also seen packages that update and break backwards compatibility, although that's less of an issue. The biggest concern we have with R, however, is that the documentation and coding practices for most R packages make small bug fixes difficult without having extensive knowledge of the package code. This is not always true, but it's true enough of the time that we can't afford to maintain much production R code.
For R: (1) instead of `sapply(nba, mean, na.rm = TRUE)` use `colMeans(nba, na.rm = TRUE)`. (2) instead of `nba[, c("ast", "fg", "trb")]` use `nba[c("ast", "fg", "trb")]`, (3) instead of `sum(is.na(col)) == 0` use `!anyNA(col)`, (4) instead of `sample(1:nrow(nba), trainRowCount)` use `sample(nrow(nba), trainRowCount)` and (5)
instead of tons of code use `library(XML); readHTMLTable(url, stringsAsFactors = FALSE)`
Python's main problem is that it's moving in a CS direction and not a data science direction.
The "weekend hack" that was Python, a philosophy carried into 2.x, made it a supremely pragmatic language, which the data scientists love. They want to think algorithms and maths. The language must not get in the way.
3.x is wanting to be serious. It wants to take on Golang. Javascript, Java. It wants to be taken seriously. Enterprise and Web. There is nothing in 3.x for data scientists other than the fig leaf of the @ operator. It's more complicated to do simple stuff in 3.x. It's more robust from a theoretical point of view, maybe, but it also imposes a cognitive overhead for those people whose minds are already FULL of their algo problems and just want to get from a -> b as easily as possible, without CS purity or implementation elegance putting up barriers to pragmatism (I give you Unicode v Ascii, print() v print, xrange v range, 01 v 1 (the first is an error in 3.x. Why exactly?), focus on concurrency not raw parallelism, the list goes on).
R wants to get things done, and is vectors first. Vectors are what big data typically is all about (if not matrices and tensors). It's an order of magnitude higher dimensionality in the default, canonical data structure. Applies and indexing in R, vector-wise, feels natural. Numpy makes a good effort, but must still operate in a scalar/OO world of its host language, and inconsistencies inevitably creep in, even in Pandas.
As a final point, I'll suggest that R is much closer to the vectorised future, and that even if it is tragically slow, it will train your mind in the first steps towards "thinking parallel".
"data analysis" means differently in R and Python. In R, it's all kinds of statistical analyses. In Python, it's basic statistical analysis plus data mining stuff. There are too many statistical analyses only exist in R.
I work with biologists. R which seems strange to me they seem to take to. I think some of it is Rstudio the ide, which shows variables in memory on the side bar, you can click to see them. It makes everything really accessible for those that aren't programmers. It seems to replace excel use for generating plots.
I've grown to appreciate R, especially its plotting ability (ggplot).
Rstudio is R for a lot of people. I'm a computational biologist in a group. Our PI is trying to get the postdocs to learn R themselves, but it's an uphill battle. I eventually warmed up to it - primarily for the plotting.
But a few weeks back he asked me how to do some kind of data sorting / manipulation in R. My answer was that it was a 10 line Python script and I gave him the code. Alas, he couldn't figure out how to save the script and run it from a command-line.
You can't underestimate at how important Rstudio is to the popularity of R for non-programmers.
I think some of it is Rstudio the ide, which shows variables in memory on the side bar, you can click to see them
This.
Most programming IDEs show the code but hide the data.
Excel shows the data but hides the code.
RStudio is awesome because it shows both the code and the data.
Language comparisons are equiv. to religion comparisons...you aren't going to find a universal answer or truth, it's an individual/faith sort of thing.
That being said - all the serious math/data people I know love both R and Python...R for the heavy math, Python for the simplicity, glue, and organization.
That is a major difference between these two languages.
Python: There should be one, and only one, preferable way to do things. Though this may not be obvious at first.
R: Every author has a different style of doing things, reflecting in the code.
As for the comparison in general: You can call R from within Python. So Python is at least as powerful as R. The rest (BeautifulSoup, Compression, Game development etc.) is icing on the cake.
How so? As someone familiar with Python but not R, I've always been hesitant to jump in. This code was very readable and made me think that it might be a far more accessible language than I'd previously assumed.
One example in the section titled "Split into training and testing sets" would be to use the createDataPartition() function from the caret package for creating training and testing sets.
He says "In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in sample function" but using caret is more concise.
Added: Later in the section on random forests he says "With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them." Which is why you want to use the caret package as it makes accessing many machine learning packages consistent and easy.
Very picky, but beware constantly using "set.seed" throughout your R scripts. Always using the same random number is not necessarily helpful for stats, and makes the R code look a lot trickier than it need be
In manufacturing Minitab and JMP are used for data analysis (histograms, control charts, DOE analysis, etc.) They are much easier to use and provide helpful tutorials on the actual analysis.
What features or workflow does R or Pandas/Numpy offer to manufacturing that Minatab & JMP can't?
R, Numpy, and Pandas are all FOSS. Probably not much of a practical concern, but it might be preferable in some cases.
I don't know anything about Minitab/JMP scripting myself, but my understanding is that R is generally the most intuitive of all the aforementioned (although that would basically boil down to individual preference).
Really, syntax "nba.head(1)" is not any more "object-oriented" than "head(nba, 1)" -- it's just syntax, and the R statement is in fact an application of R's object system (there are several of them).
IMO, R's system is actually more powerful and intuitive -- e.g. it is fairly straightforward to write a generic function dosomething(x,y) that would dispatch specific code depending on classes of both x and y.
That's good to know, thanks :)
Although, for single dispatch, the S3 system of R is kinda hard to beat -- you just name your function print.myclass and you are done :)
In general, if I have to chose between two languages, one of which was designed specifically for statistics, and one that was more general, I will chose the more general one.
R's value is in the implementation of its libraries but there is no technical reason a really OCD person couldn't implement such high quality of libraries in Python.
It would be nice to also have some notes about performance of both the languages for each of the tasks compared. I believe pandas would be faster due to its implementation in C. The last time I checked R was an interpreted language with its interpreter written in R.
And like pandas, many of the performance bottlenecks in R have been re-written in C. See dplyr and data.table for packages that solve a similar problem to pandas with similar speed (and for some scenarios they're actually faster!)
Caret is a great package for a lot of utility functions and tuning in R. For example, the sampling example can be done using Caret's createDataPartition which maintains the relative distributions of the target classes and is more 'terse'.
i tried help my wife who use R in school, only to get quickly lost.
also attended ~1 hour R course on university.
to me, R was a waste of time and I really dont understand why its so popular in academia. if you already have some programming knowledge, go with Python + Scipy instead
EDIT: R is even more useless without r studio, http://www.rstudio.com/. and NO, dont go build a website in R!
Maybe you didn't mean it this way, but to me your comment reads as, basically, "I tried R for an hour and didn't immediately grok it, therefore it is a waste of time."
That may not be what you meant, so I haven't downvoted yet, but it doesn't seem to be an attitude that is helpful for the conversation.
Thanks for your explanation. It seems my ability to communicate is getting worse every year :-/.
What I meant to say was that I helped my wife during her master thesis (~6 months) with R, in addition to spending an hour in one of the classes.
Her teachers also were novices of both R and Excel, and we had several issues with everything from how R processes csv:s, to just figuring out the proper syntax to have R do what we wanted.
Sorry if my comment wasnt helpful, i was merely attempting to add some reflections from personal experience to the discussion.
I disagree with R being more useless without r studio. I'm not a fan of R overall, but I run everything in tmux+vim and R is the same way. I prefer it to Rstudio. It's popular, because it makes a few choices which are different many programming languages to be geared towards writing scripts for statistics. (e.g. index 0, assignments)
I'll second the utility of alternative environments to RStudio. For me, I love RStudio, but I spend too much time in Python (and occasionally dabbling in others) to use it all the time. So, for me it's Emacs Speaks Statistics, which is fantastic.
As a side benefit, the first time I tried dabbling in Julia, I was pleasantly surprised to have a familiar mature environment work with it out of the box.
But when you start having to massage the data in the language (database lookups, integrating datasets, more complicated logic), Python is the better "general-purpose" language. It is a pretty steep learning curve to grok the R internal data representations and how things work.
The better part of this comparison, in my opinion, is how to perform similar tasks in each language. It would be more beneficial to have a comparison of here is where Python/Pandas is good, here is where R is better, and how to switch between them. Another way of saying this is figuring out when something is too hard in R and it's time to flip to Python for a while...