Split Data into Train & Test Sets in R (Example)

This article explains how to divide a data frame into training and testing data sets in the R programming language.

Table of contents:

2) Example: Splitting Data into Train & Test Data Sets Using sample() Function

Let’s dive right into the example!

Creation of Example Data

As a first step, we’ll have to define some example data:

set.seed(92734)                                    # Create example data
data <- data.frame(x1 = rnorm(1000),
                   x2 = rnorm(1000))
head(data)                                         # First rows of example data
#           x1         x2
# 1  0.1016225  1.2073856
# 2 -0.8834578 -1.9778300
# 3 -1.2039263 -0.9865854
# 4  1.4898048  0.4344165
# 5  0.2844304  0.6180946
# 6  0.3927014  2.3363394

The previous RStudio console output shows the structure of our exemplifying data – It consists of two numeric columns x1 and x2 and 1000 rows.

Let’s split these data!

Example: Splitting Data into Train & Test Data Sets Using sample() Function

In this Example, I’ll illustrate how to use the sample function to divide a data frame into training and test data in R.

First, we have to create a dummy indicator that indicates whether a row is assigned to the training or testing data set.

At this point, we are also specifying the percentage of rows that should be assigned to each data set (i.e. 70% training data and 30% testing data).

split_dummy <- sample(c(rep(0, 0.7 * nrow(data)),  # Create dummy for splitting
                        rep(1, 0.3 * nrow(data))))
split_dummy                                        # Print dummy
# 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 ...

Let’s double-check the frequencies of our dummy:

table(split_dummy)                                 # Table of dummy
#   0   1 
# 700 300

As you can see, the dummy indicates that 700 observations will be assigned to the training data (i.e. 0) and 300 cases will be assigned to the testing data (i.e. 1).

Now, we can create a train data set as shown below:

data_train <- data[split_dummy == 0, ]             # Create train data

Let’s have a look at the first rows of our training data:

head(data_train)                                   # First rows of train data
#           x1         x2
# 1  0.1016225  1.2073856
# 3 -1.2039263 -0.9865854
# 4  1.4898048  0.4344165
# 5  0.2844304  0.6180946
# 6  0.3927014  2.3363394
# 7 -2.1504326 -3.2133342

As you can see in the previous RStudio console output, the rows 2, 3, 5, 6, 7, and 8 were assigned to the training data.

We can do the same to define our test data:

data_test <- data[split_dummy == 1, ]              # Create test data

Let’s also print the head of this data set:

head(data_test)                                    # First rows of test data
#            x1          x2
# 2  -0.8834578 -1.97783004
# 19  1.5163172  1.01214201
# 22  0.8645558  0.13248672
# 26 -1.0282859  1.95519235
# 30  0.7787774 -1.67456341
# 33  0.0673346  0.08527785

Looks good! Now, you can use these data sets to run your statistical methods such as machine learning algorithms or AB-tests.

Video & Further Resources

Do you need further explanations on the R codes of this article? Then you might want to watch the following video of my YouTube channel. In the video, I’m explaining the examples of this tutorial in RStudio.

The YouTube video will be added soon.

Furthermore, you may want to read the related articles of my website.

In summary: At this point you should have learned how to split data into train and test sets in R. Note that you may use a similar approach to create a validation set as well. Please tell me about it in the comments below, in case you have further questions and/or comments.

2 Comments. Leave new

Scott
December 20, 2021 9:51 pm

When I run this script, split_dummy # Print dummy
# 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 … and not as you have indicated,
split_dummy # Print dummy
# 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 …

therefore output of head(data_train) and head(data_test) won’t be the same as yours. I am getting the right ratio (700:300) but not the same order in the split_dummy sequence.

Reply
- Joachim
  December 21, 2021 9:10 am
  
  Hey Scott,
  
  Thanks a lot for the hint, it seems like I have messed something up with the seed.
  
  I have just fixed it.
  
  Thanks again,
  Joachim
  
  Reply