Tidyverse Troubles

1 Introduction

So, the article talks about the tidyverse, which is a set of packages for the R statistical programming language. These packages provide a really, really good API for interactive data analysis.

I remember the first time I read the dplyr vignettes (late 2014, early 2015), i was completely blown away. Just the very notion of avoiding either incredibly deep function calls and not needing to name far too many local variables was an absolute revelation.

The use of non-standard evaluation and clever scoping rules made it an absolute no-brainer to switch to tidyverse (well, dplyr at least) tools for much of my analysis.

original_version <- head(filter(prop_df, county=="Dublin", price<1e6, 
county==str_match(postal_code, "Dub")))

tidyverse_version <- filter(prop_df, county=="Dublin", price<1e6, 
county==str_match(postal_code, "Dub")) %>% head()

So, above we can see the original version and the tidyverse version. In a simple example like this, there's not a huge difference between them. However, when the pipe chains get longer, things can get really, really annoying

2 The Problem

So, while judicious use of the tidyverse and the pipe operator is completely OK, the real problems started to arise after this style of programming became popular.

In particular, there are now many R-users who are completely unfamiliar with base R and the way we used to handle scoping and accessing columns in a df.

For example, the code below.

myres$newvar <- with(myres, old_var1*std(old_var2))

myres <- mutate(myres, newvar=old_var1*std(old_var2))

I showed something similar to a bunch of people at a talk some years back, and I needed to explain the first line and many of the R-users at the talk were completely unfamiliar with it.

This is a real problem for a bunch of reasons:

  1. R is a good language out of the box, but many people seem to ignore base R
  2. The tidyverse packages are evolving pretty quickly, which means that you'll need to update your code more often
  3. Many R scripts run in production for a long time

If you use tidyverse tools to do this, the likelihood is that you'll need to update it much, much more often than if you used base tools.

Additionally, the style of building pipes and using NSE pervasively, while useful for a data-analytic workflow, is a really bad idea if you want to turn the code into functions and/or run the code in production.

2.1 Down the drain-pipe

The biggest problem that I have when dealing with pipe focused style (as popularised by tidyverse tools) is that it tends to get everywhere.

In my last two gigs, I've dealt with lots and lots of legacy R code. While the general principles around legacy code are language independent, the pipe adds an extra layer of complexity for not a lot of reward.

In many cases, I have seen entire (complicated) functions expressed as a single pipe transform. This is fine, as long as nothing ever breaks. Unfortunately, due to the lack of API stability within the tidyverse, stuff always breaks.

Now, all software breaks which is why we have good tools to deal with it. In R, I am a big fan of the trace function, which is called as follows:

trace(my_buggy_function, browser)

This has the exact same impact as calling browser at the start of the function, in that you can step through each line of code using a bunch of commands 1. The important part here is that we can step through each line and examine the intermediate variables. However, when a function is defined entirely as a pipe, this is essentially impossible. Even if you step into the pipe function, it's very difficult to get the actual local variables to each function out.

So, inevitably what I end up doing is hacking apart pipes into hopefully well-named variables. Which leads me nicely onto my next point

2.2 Naming is important.

There are two hard problems in computer science, cache invalidation, naming things and off-by-one errors

Names are incredibly important when programming. If you don't believe me, then ask Martin Fowler. In general, a good name (for a function or variable) reduces the cognitive load required to understand a program. Conversely, bad names have negative value, and can considerably lengthen your debugging time.

The tidyverse/pipe style encourages multiple transforms to be processed in one line, or at least one set of piped-function calls. While this can be good, if the parts are independent and appropriately named, it can be incredibly bad if this is not the case.

In one particularly egregious case, I saw something like the following

result  <- bind_rows(some %>% complicated %>% pipe %>% logic,
          more complicated logic,
          ...100 lines of business logic)

This is just a terrible idea for a number of reasons.

Firstly, if each intermediate part were well named, then it would be easier for the reader to grasp the conceptual structure (which is super-important if you need to change stuff).

Secondly, the control flow reverses in the pipes. Our bind_rows call is read from right to left, while the pipes are read left to right. This is incredibly confusing, and the worst part is that this code is much harder to modify than it needs to be, and there are essentially no benefits except the avoidance of needing to name some local variables (which is actually not really a benefit).

2.3 Programming with tidyverse and pipes

The final problem that I have with the tidyverse/pipe style is that it makes programming needlessly difficult, which means that it encourages code-repetition and other bad practices.

The core issue here is the trade-off between a programming environment, and a data analysis environment. When analysing data, the ability to quickly edit my previous pipes and pipe one thing into another three things is really really useful.

And in this situation, all the NSE stuff is fine because you're normally working on bespoke analysis, and the cleaning, aggregating and grouping is pretty easy to do and doesn't really benefit from functionalisation.

However, as we all know, there's nothing so permanent as a temporary solution. One day, your random explorations may be made part of some production work (normally analytics, but this is even more critical to get right). And that's where the problems begin.

So, there is a way to program effectively with NSE/tidyverse. Right now, the best practice is rlang while a few years back, it was lazyeval 2. It is incredibly problematic that there is no easy on-ramp to programming from tidyverse, but this is currently the case. In general, the simple things (like group_by a passed in variable) are pretty annoying to do and require you to look like a crazy person in your code (like the last thing R needed was more unpronouncebale acronyms like !! (or is that !!! I always forget).

This retards the natural growth of a programmer and data scientist. In many ways, the tidyverse is a blocker on the development of better tools for R users as it's created a pretty large divide between users of the language and creators of the language.

3 Conclusion

I'm not against the tidyverse in general (in fact, I use it almost every day). It's designed for interactive data analysis and should be avoided when trying to build functions and more production-like systems.

Footnotes:

1

I did a presentation on this a few years back, which can be found here

2

i'm still bitter at how much work I needed to do to understand lazyeval after it had been deprecated to update some legacy code.

Author: richie

Created: 2020-05-13 Wed 13:03

Validate