NA Values are not Excluded when Using cor Function in R (Example)
In this tutorial you’ll learn how to exclude NA values when using the cor function in the R programming language.
The table of content is structured as follows:
Let’s get straight to the example.
First, we will need to create an example data matrix with 10 random NAs:
set.seed(999) sample_data <- matrix(rnorm(30), nrow=10, ncol=3) na_values <- sample(1:length(sample_data), 10) sample_data[na_values] <- NA colnames(sample_data) <- paste("col", 1:ncol(sample_data),sep="") rownames(sample_data) <- paste("row", 1:nrow(sample_data), sep="") sample_data # col1 col2 col3 # row1 NA -0.7602105 -0.02891409 # row2 0.3729390 -1.2028067 NA # row3 NA 0.7081885 0.46330763 # row4 NA NA 0.63862129 # row5 0.5190691 NA -0.13322764 # row6 1.0478328 -0.8561538 1.04789141 # row7 NA 0.1950530 NA # row8 -1.4076432 0.4192383 1.64057417 # row9 NA 0.2887847 0.14849188 # row10 NA 1.4041693 -1.03728957
As you can see, our matrix contains three numeric columns and ten rows with 10 NA values.
Let’s try to make a correlation matrix using the cor() function:
cor(sample_data) # col1 col2 col3 # col1 1 NA NA # col2 NA 1 NA # col3 NA NA 1
The cor() function doesn’t exclude NA values, so we are not getting meaningful results. Let’s see what we can do about it.
Example: Excluding NA Values in cor Function Using “use=” Argument
To get meaningful values in a correlation matrix with NA values, we can apply the “use=” argument inside the cor() function. This is an optional character string that gives us a method for computing covariances in the presence of NA values:
cor(sample_data, use="pairwise.complete.obs") # col1 col2 col3 # col1 1.0000000 -0.8899394 -0.6067244 # col2 -0.8899394 1.0000000 -0.4271438 # col3 -0.6067244 -0.4271438 1.0000000
In this specific example, we have used the option “pairwise.complete.obs”. Have a look at the help documentation of the cor function to get further information on other methods.
Please note that the specification use = “pairwise.complete.obs” can lead to bias, and hence, misleading results. You can check how listwise deletion for missing data works before using this specification.
Alternatively, you might impute your missing values to create a data frame that contains only non-NA values before calculating the correlations. You can find more on missing data imputation techniques here.
Video, Further Resources & Summary
Do you need more explanations on what to do when NA values are not excluded from the cor() function in R? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel.
The YouTube video will be added soon.
Furthermore, you could have a look at some of the other tutorials on Statistics Globe:
- R NA – What are
- Remove NA Columns from xts Time Series in R
- Remove Columns with Duplicate Names from Data Frame in R
- NA Omit in R | 3 Example Codes for na.omit
This post has shown how to exclude NA values when using the cor() function in R. In case you have further questions, you may leave a comment below.
This page was created in collaboration with Paula Villasante Soriano. Please have a look at Paula’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.