Modes & the Origins of the stackoverflow R Package

 

As I was reading the recent post on Mode Imputation, I was happy to see an old favorite piece of R poetry:

val[which.max(tabulate(match(...)))]

I’ll discuss why I recognized this idiom, and how it works.

Neal Fultz R Package Developer & Programmer

Note: This article was created in collaboration with Neal Fultz. Neal is a statistical scientist at the University of California, Los Angeles, and the owner of njnm consulting. You may find more information about Neal on his profile page.

 

Background

I first came across the idiom in my first job out of grad school – I’d been hired on a small team at a large company that had just successfully finished a busy season. By the time I actually started, it was a bit slow. I spent a lot of time on stackoverflow.com curating answers – in R, there’s often many different valid ways of solving a problem, depending on if you are using base R, dplyr, data.table, or other packages.

I spent so much time on Stack Overflow that they started showing me job ads in the sidebar, and eventually I clicked on one, which led to an interview and eventually a new job at a startup using R extensively.

At my new job, I recognized the “Mode Idiom” in the codebase, and as I read more and more, I found several other functions borrowed from Stack Overflow. Code on Stack Overflow is licensed under a Creative Commons license, which allows you to reuse it as long as you cite it correctly – easy enough to do, just add a comment. It’s sloppy not to, and if you don’t follow the terms of the license, it’s technically copyright infringement.

 

I dreamed that one day, one of my SO answers would be copy-pasted into a large project, by a big company, say an Oracle or Microsoft, and then suing for copyright infringement and getting a large settlement. It would be like hitting the lottery, in both the slim chances, and the payout.
Neal Fultz Funny Image

 

Anyway, I shared this all with my friend from the LA R User Group, Eduardo, who was at that time doing technical due diligence around the purchase of a different startup. He was like, oh, that’s a good thing to check for, and then a couple of weeks later, he told me he’d found substantial plagiarism in their code base and had knocked off 10% of their valuation.

 

The stackoverflow Package

At that point, I got paranoid and decided to clean up everything, and move all the SO functions into their own package, and so the stackoverflow package was born. I had taken an IP Law class when I was at Berkeley, and while much of it was amusing cases about video games, it also instilled a healthy respect for copyright.

By completely separating and packaging the code, developers no longer need to copy-paste, and the lawyers no longer need to worry about viral software licenses.

You can install it from CRAN:

install.packages("stackoverflow")

Or install the development version:

remotes::install_github("nfultz/stackoverflow")

In addition to the Mode function, it also has helper functions for programming (e.g. zip and enumerate from Python) and odds-and-ends like Tarone’s Z test and sampling from the conditional Weibull distributions.

If you find something interesting on SO, please email me at neal@njnm.co and I can add it to the package.

 

Calculating the Mode

In March 2010, user Nick asked How to find the statistical mode? The question got 36 answers and has been viewed over 330,000 times.

If you review the different answers, you will see a variety of different strategies for solving this problem.

Ken Williams posted the below, which I consider the most beautiful answer:

Mode <- function(x) { 
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))] 
}

This idiom for finding the most common element in a vector has three main advantages that make it my favorite:

  • Entirely in base R, so no extra packages are required. Many of the other answers required other packages like data.table or dplyr.
  • Extremely fast, can be orders of magnitude faster than naive solutions. Answers that sort the data can scale poorly.
  • Works for both character and numeric data. Some other answers don’t work for character data.

It works by composing several basic functions:

  • unique(x) – returns unique elements of X by removing duplicates
    > unique(c(‘A’, ‘B’, ‘C’, ‘A’))
    [1] “A” “B” “C”
  • match(x, ux) – returns the position of an element in a table
    > match(x, unique(x))
    [1] 1 2 3 1
  • tabulate() – the engine underneath the table function, but only works with integers
    > tabulate(c(1,2,3,1))
    [1] 2 1 1
  • which.max() – finds the position of the maximum without sorting.
    > which.max(c(2,1,1))
    [1] 1

Taken all together, it first converts the input to an integer code, similar to factor coding, and then uses tabulate to aggregate the input, and which.max to find the largest value without sorting. Each of these underlying functions are very simple, but that also makes them very fast.

In the stackoverflow package, it’s been consolidated even further for cases when the unique values are known ahead of time, such as demographics fields:

Mode <- function(x, ux=unique(x)) ux[which.max(tabulate(match(x, ux)))]

Stylistically, it’s a bit dense, but I like that it can do so much in a single line. Also note that it’s named capital-M Mode, because lower-case mode is already used in base R for other purposes.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top