Difference Between subset & filter of dplyr Package in R

In this tutorial, we will explore the differences between the subset() and filter() functions of the dplyr package in R.

The dplyr package is a powerful and widely-used tool for data manipulation in R.

Both subset() and filter() functions are used for selecting specific rows of a data frame based on specified conditions.

We will discuss each function in detail, provide examples, and describe the output of each code snippet. Finally, we will conclude with a comparison of these two functions.

The table of contents is shown below:

1) Introduction to the Subset Function

2) Introduction to the Filter Function

3) Conclusion

4) Further Resources

5) Subscribe to the Statistics Globe Newsletter

6) Thank you!

Let’s dive into it!

Introduction to the Subset Function

The subset() function is part of the base R package and is used for subsetting data frames based on specified conditions. The syntax of the subset() function is as follows:

subset(x, subset, select, drop = FALSE)

Where:

x: The data frame to be subsetted.
subset: The conditions for selecting rows.
select: The columns to be selected.
drop: If TRUE, the result will be coerced to the lowest possible dimension.

Let’s take a look at an example of the subset() function in action.

First, let’s load the mtcars dataset:

data(mtcars)

Then, let’s subset the data to include only cars with 6 cylinders and horsepower greater than 110:

selected_data <- subset(mtcars, cyl == 6 & hp > 110)

The resulting ‘selected_data’ data frame will contain only the rows from the ‘mtcars’ dataset where the ‘cyl’ column value is 6 and the ‘hp’ column value is greater than 110.

Now that we have discussed the subset() function, let’s move on to the filter() function in the dplyr package.

Introduction to the Filter Function

The filter() function is part of the dplyr package and is also used for subsetting data frames based on specified conditions. The syntax of the filter() function is as follows:

filter(.data, ..., .preserve = FALSE)

Where:

.data: The data frame to be filtered.
…: The conditions for selecting rows.
.preserve: If TRUE, the original row order will be preserved.

Before using the filter() function, you must install and load the dplyr package:

# Install dplyr package if not already installed
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
 
# Load dplyr package
library(dplyr)

Let’s take a look at an example of the filter() function in action.

# Load the mtcars dataset
data(mtcars)
 
# Filter the data to include only cars with 6 cylinders and horsepower greater than 110
selected_data <- mtcars %>% filter(cyl == 6 & hp > 110)

The resulting ‘selected_data’ data frame will contain only the rows from the ‘mtcars’ dataset where the ‘cyl’ column value is 6 and the ‘hp’ column value is greater than 110, similar to the subset() function.

With both the subset() and filter() functions explained, let’s compare them in the conclusion.

Conclusion

In summary, both the subset() and filter() functions are used for subsetting data frames based on specified conditions. The main differences between them are:

Package: subset() is part of base R, while filter() is part of the dplyr package.
Syntax: The syntax for the filter() function is more concise and is often used in combination with other dplyr functions through the use of the pipe operator (%>%).
Row order preservation: The filter() function provides an option to preserve the original row order through the .preserve argument, while subset() does not have a similar option.

Overall, while the subset() function is more accessible since it is part of base R, the filter() function is more flexible and can be easily combined with other dplyr functions, making it the preferred choice for many R users.

Further Resources

In case you are eager to learn more about how to filter and subset your data in R programming, you may have a look at the following list of tutorials: