I've been using R nonstop for pretty much 5+ years. I'm happy that there's established competition coming from Python and new competition coming from Julia. Having these languages compete over similar types of programmers pushes each one to be better, which is awesome. I'm not a die-hard R person, I'd be more than happy to switch under the right circumstances.
But...I think one thing gets overlooked way too often. For "data scientists" or "statisticians" or [insert new term here], the majority our non-modeling time is spent on just plain old data wrangling. To me, R is unbeatable here. I've tried Python ~2 years ago and pre-1.0 Julia.
Using tidyverse you can do pretty much anything to any dataset, often *without a monstrous amount of keystrokes*. (The pipe syntax is awesome). If you really need speed you can always switch over to data.table for uglier but faster code. I really tried but I could never replicate the "brain cycles to keystrokes" speed of R in Python/Julia. That is, being able to intuitively and quickly just convert my thoughts into readable data wrangling code.
Sure the base R language is not that "fast" and Julia/Python benchmarks are way faster. But in practice this doesn't matter to me. Most of the performance sensitive packages are written in C/C++/Fortran anyway (rstan, brms, glmnet, caret). I don't care that I could write 3x faster loops. The extra 5 seconds for that one piece of code doesn't make up for the absence of a good data wrangling ecosystem.
My message to the Julia team: You can get a very large portion of the R userbase to switch over if you focus on a Julia version of the tidyverse (especially dplyr). I know that DataFrames.jl exists but it just doesn't even come close. There's a difference between "you can do this in Julia too" and "here's a clean/intuitive way to do this better without extra baggage".
I'm sorry if the above seems harsh. I genuinely appreciate the Julia team's efforts. I can only imagine how hard it is to create a new language. I just wanted to be honest.
I deeply loath R for its terrible type idiosyncracies, syntax, and slowness.
However, even I must admit that it is incredibly good at what it was meant to do - analyse and display data. (And yes, the tidyverse is a huge improvement of the syntax, although it's telling that they basically reinvented the language to do so.)
As an ecological modeller, I create my actual simulation models in Julia, because it is a much, much better language for any real programming. But I still analyse the output in R.
I don't understand how people can loath R. If you take a functional approach, especially using pipes, dplyr and a split, apply, combine style, it is quite beautiful. Much nicer than trying to, say, divide a time period by an integer in Go.
> If you take a functional approach, especially using pipes, dplyr and a split, apply, combine style, it is quite beautiful
Sure, but what if you don't? Sometimes, this is the right way to do things, other times there are other approaches that are more natural/beautiful. In many cases, a loop with conditionals is much easier to understand.
I use a lot of R, and like many aspects of it. But the fact that `f(stop("Hi!"))` may or may not throw an error depending on the internals of `f` is a little maddening. (And there are tons of similar issues.)
When it comes to data wrangling, one huge advantage of Julia over tidyverse/R dataframes/Pandas is that you can write a damn for loop and it won't be brutally slow.
It's so much simpler and faster to use a loop that says "pick this row only if this and that and this other thing are sometimes true" vs having to construct an algebra of column filters to do the same.
I think that is absolutely a fair criticism. Personally, I rarely run into an issue where I absolutely am bottlenecked by a slow loop. But this sort of thing drew me to Julia in the first place.
There was also an R update in ~2017 that introduced some JIT speed-ups for loops, which made a noticeable difference.
If this is a problem you run into often, I suggest converting your object to a data.table. You can pass a function row-wise over the object very quickly:
I think loops are not ideal for data analysis. They are prone to human error, especially ones that modify the data, and in a way that can be hard to sort (i.e.iterating over the dimensions of the wrong object). A stepwise creation of new logical fields using mutate, and then a vectorised ifelse command is more robust and you can clearly see steps of the logic.
I mean I get yout point.
Julia has a bit of a Lisp's Curse
http://winestockwebdesign.com/Essays/Lisp_Curse.html
Writing a performant and easy to use data wrangling library for R is a bunch of work and means dealing with C/C++ etc.
So few people are willing to do so, and just contribute to a small number of libraries like dplyr.
(I feel like there are at least 2 other major compeditors to that in R?)
Where as in julia it's really easy to write a new data wrangling library.
Its just not that much work. So people:
A) do it for just fun / student projects (None of those ones are though).
B) do it because they have a nontrivially resolvable opinion (e.g. Queryverse has a marginally more performant but marginally harder to use system for missing data)
Nice thing about julia, especially for tabular data (thanks to Tables.jl), is everything works together.
It's actually completely possible to mix and match all of those libraries in a single data processing pipeline.
Which while is generally a weird thing to do, it does mean if you have a external package uses any of them it works into a pipeline of another.
(One common case is that queryverse has CSVFiles.jl, but CSV.jl actually is generally faster, and you can just swap one for ther other, inside a Query.jl pipeline)
I absolutely argee this makes learning harder.
---
Also that particular example:
> "I need pipes to help me wrangle data more efficiently do I use Base Julia, Chain.jl, Pipe.jl, or Lazy.jl?"
It's piping.
Something would have to massively be screwed up if any of those options were more or less efficient than the others.
The only question is what semantics do you want.
Each is pretty opinionated about how piping should look.
The Lisp Curse was written by then inexperienced web developer, with (then, and likely now still) zero Lisp experience, based on extrapolating something he read about Lisp in an essay by Mark Tarver. He prefers it not be submitted to HN due to the embarrassment, yet for some reason keeps the article up (probably because it generates traffic).
Yeah, NSE (non-standard evaluation) is really annoying to work with in dplyr/tidyverse codebases, and this definitely inhibits people from building on top of them.
They are an 80% solution for a lot of data analytic needs, but base-R is 100% the right choice if you want your code to run for a long time without needing updates.
I've never really gotten into data.table for some reason, normally dplyr is fast enough, or I'm using something more efficient than R.
What a constructive, positive, down-to-earth, well-written comment, and what a nice reprieve from everything that's broken about the tone of web discussions these days. You point out that there's still another player in this space (R), but not in a way that's whiny, dismissive, or doctrinaire, and you celebrate the healthy competition. You suggest a streamlined path toward Julia ecosystem maturity, rooted in real-world needs. Nicely done!
I have no real dog in this fight, but I hope Julia team members (and/or aspiring Julia ecosystem contributors) will read and consider your point.
This whole thread seems to be quite civilized. I can see no name-calling or off-topic rants, only a frank exchange of opinions, mixed in with some facts.
Your post seem to indicate that there is some sort of 'fight' going on, or that the tone is broken. I disagree. If most web discussions were like this one, we would have fewer problems in this world.
Oh, that's exactly what I mean -- when I say "everything that's broken about the tone of web discussions these days", I'm talking about threads and topics other than this one. I don't see any 'fight' here, and that's what's so refreshing.
All right! I got the impression you were contrasting that particular post with the rest of this discussion, but apparently not. Still slightly confused here. Oh well, carry on.
Another big thing that R has an edge over python (and I guess Julia, but not sure) is making quick yet presentable plots of data that contain different factors that you want to show together. The matplotlib equivalent requires tracking different indices and manually adding layers for different indices.
I worked with R and Python during the last 3 years but learning and dabbling with Julia since 0.6. Since the availability of [PyCall.jl] and [RCall.jl], the transition to Julia can already be easier for Python/R users.
I agree that most of the time data wrangling is super confortable in R due to the syntax flexibility exploited by the big packages (tidyverse/data.table/etc). At the same time, Julia and R share a bigger heritage from Lisp influence that with Python, because R is also a Lisp-ish language (see [Advanced R, Metaprogramming]). My main grip from the R ecosystem is not that most of the perfomance sensitive packages are written in C/C++/Fortran but are written so deeply interconnect with the R environment that porting them to Julia that provide also an easy and good interface to C/C++/Fortran (and more see [Julia Interop] repo) seems impossible for some of them.
I also think that Julia reach to broader scientific programming public than R, where it overlaps with Python sometimes but provides the Matlab/Octave public with an better alternative. I don't expected to see all the habits from those communities merge into Julia ecosystem. On the other side, I think that Julia bigger reach will avoid to fall into the "base" vs "tidyverse" vs "something else in-between" that R is now.
Out of curiosity, when was the last time you looked at DataFrames.jl? A huge amount has happened in the last year. Plus, if you want more tidy-like syntax, you can go with Query.jl, (or DataFramesMeta.jl, though that isn't quite finished updating to the the new DataFrames syntax), or of you just want pipes on DataFrame operations, there's Pipe.jl and Chain.jl.
I don't think your comments are harsh, you need what you need and you like what you like. I do mostly data wrangling too, but feel much less constrained with Julia than with tidyr. Sometimes having constraints and one right way to do things is good, but it's not for me.
Also worth noting it's not necessarily on the language developers to do this. Even in R, tidyverse is in packages, not in the base language.
My experience with R was somewhat different. R was my first computational language in 2006 (version 2.3, IIRC), and parsing real life data (biological, in my case) into a format acceptable to R was a non-trivial exercise. I had somebody write me a perl script to parse the raw data into a clean CSV, but that has its own problems. The tools that were the kernel of the tidyverse (created 2014) were just beginning to show up, and even magrittr pipes were many years away. The only tidyverse tool even close to mature at the time was ggplot. For me data munging was the limiting factor, and at some point I discovered many people prefer Python for these initial steps. In 2013 I learnt Python with the explicit aim of data munging, while continuing analyses in R. With Pandas I could cover 80% of my use case for R, and eventually dropped it completely. Again, this predates the creation of the tidyverse, which I noted with some irony.
For what its worth, Hadley Wickham was asked in a Reddit AMA several years ago about which platform he'd choose if he was just starting out. He pointed to Julia as his pick.
> My message to the Julia team: You can get a very large portion of the R userbase to switch over if you focus on a Julia version of the tidyverse (especially dplyr).
If we removed dplyr, then R scripts would absolutely scream so I find the speed argument for 'why switch to X' unconvincing. If users cared so deeply about speed, almost no one would be using tidyverse instead we'd all be using base-R or data.table.
Multiple dispatch? Hmm is this really a problem that I'm going to come across in the real-world when 90% of our time is spent ingesting a poorly-formatted csv, doing some quick plots and perhaps building a model to test something out. If the goal of Julia is to replace R/Python then their priorities feel way off the mark
> If the goal of Julia is to replace R/Python then their priorities feel way off the mark
There's a lot more to scientific computing than wrangling tabular data. Julia is competing in that overall space with R/Python/Fortran/Java/C++. If R or Pandas is better at data wrangling, then Julia won't win out there. But so be it. No PL is best at everything.
> There's a lot more to scientific computing than wrangling tabular data.
Also a point that gets ignored way too often. My original post differentiated between time spent writing models and time spent data wrangling.
I would never even attempt to write a symplectic integrator in base R (OK maybe Rcpp would be fine but that's not really "R"). Julia, by design, is better at that. But the R ecosystem is so good that I can use the best practical implementation of a symplectic integrator to solve common modeling problems via RStan.
Yes, Stan is a standalone framework that can be accessed from Julia as well. But the following workflow can be done in R much easier:
1) Read in badly formatted CSV data
2) Wrangle the data into a useable form
3) Do some basic exploratory analysis (including plots)
4) Write several models in brms/raw Stan (via rstan)
5) Simulate from the priors and reset them to more sensible values
6) Run the model over the data to generate the posterior
7) Plot/run posterior predictive checks, counterfactual analysis, outlier analysis (PSIS or WAIC), etc.
Again, the above represents my common use case. I fully appreciate that people use Julia to do awesome stuff like "the exploration of chaos and nonlinear dynamics." [0]. I understand that the modern R ecosystem isn't really built for this.
Totally agree there. It is not a replacement and it is trying to solve a different problem. I dont believe Julia contributers are lying awake at night upset that other languages exist and feel they need to put a stop to that. My point (put across clumsily I see) is that IF that was their goal then they are going about it the wrong way as most R/Python users have different priorities. But it is a moot point as that would be an absurd motivation to create a whole new language
> is this really a problem that I'm going to come across in the real-world when 90% of our time is spent ingesting a poorly-formatted csv, doing some quick plots and perhaps building a model to test something out
Yes, multiple dispatch is not some highfalutin ivory tower concept that only comes up in specialized code. For example, the model in question could define custom plotting recipes[1] so that you can just call plot() and have it produce something useful.
Also, why shouldn't dplyr perform comparably against data.table? Seems like there would be no need for a fragmented library ecosystem here if the abstractions the tidyverse is built upon were lower-cost. Moreover, what if my data isn't CSV or in a table-like shape at all? "real world" does not mean the same thing across different domains.
> Yes, multiple dispatch is not some highfalutin ivory tower concept that only comes up in specialized code. For example, the model in question could define custom plotting recipes[1] so that you can just call plot() and have it produce something useful.
This is literally the whole conception behind generic functions in R (print, plot, summary etc).
I agree it's great, but Julia is building on a lot of prior art here.
For sure, and one would be remiss not to mention Dylan, CL/CLOS and Clojure here as well. My quibble was with the claim that multiple dispatch rarely shows up in practice, which you've pretty clearly shown is not the case in R!
'highfalutin ivory tower' is a great name for a band :D
Naturally you are correct and I am wrong to dismiss it as unimportant. What I'm saying is that the majority of R/Python users today are not looking for ultimate speed or sophisticated programming paradigms. Most users are doing the unsexy bread and butter of 'Take some tabular data' -> analyse -> report on it and I want to dismiss the argument of 'users will migrate to Julia because of these nifty features' because it ignores the very reasons the existing users use these tools in the first place. It would be as absurd as proclaiming Excel users will switch to Python because the accounts deparment suddenly cares about NLP.
But...I think one thing gets overlooked way too often. For "data scientists" or "statisticians" or [insert new term here], the majority our non-modeling time is spent on just plain old data wrangling. To me, R is unbeatable here. I've tried Python ~2 years ago and pre-1.0 Julia.
Using tidyverse you can do pretty much anything to any dataset, often *without a monstrous amount of keystrokes*. (The pipe syntax is awesome). If you really need speed you can always switch over to data.table for uglier but faster code. I really tried but I could never replicate the "brain cycles to keystrokes" speed of R in Python/Julia. That is, being able to intuitively and quickly just convert my thoughts into readable data wrangling code.
Sure the base R language is not that "fast" and Julia/Python benchmarks are way faster. But in practice this doesn't matter to me. Most of the performance sensitive packages are written in C/C++/Fortran anyway (rstan, brms, glmnet, caret). I don't care that I could write 3x faster loops. The extra 5 seconds for that one piece of code doesn't make up for the absence of a good data wrangling ecosystem.
My message to the Julia team: You can get a very large portion of the R userbase to switch over if you focus on a Julia version of the tidyverse (especially dplyr). I know that DataFrames.jl exists but it just doesn't even come close. There's a difference between "you can do this in Julia too" and "here's a clean/intuitive way to do this better without extra baggage".
I'm sorry if the above seems harsh. I genuinely appreciate the Julia team's efforts. I can only imagine how hard it is to create a new language. I just wanted to be honest.