Split Data into Train & Test Sets in R (Example)
This article explains how to divide a data frame into training and testing data sets in the R programming language.
Table of contents:
Let’s dive right into the example!
Creation of Example Data
As a first step, we’ll have to define some example data:
set.seed(92734) # Create example data data <- data.frame(x1 = rnorm(1000), x2 = rnorm(1000)) head(data) # First rows of example data # x1 x2 # 1 0.1016225 1.2073856 # 2 -0.8834578 -1.9778300 # 3 -1.2039263 -0.9865854 # 4 1.4898048 0.4344165 # 5 0.2844304 0.6180946 # 6 0.3927014 2.3363394
The previous RStudio console output shows the structure of our exemplifying data – It consists of two numeric columns x1 and x2 and 1000 rows.
Let’s split these data!
Example: Splitting Data into Train & Test Data Sets Using sample() Function
In this Example, I’ll illustrate how to use the sample function to divide a data frame into training and test data in R.
First, we have to create a dummy indicator that indicates whether a row is assigned to the training or testing data set.
At this point, we are also specifying the percentage of rows that should be assigned to each data set (i.e. 70% training data and 30% testing data).
split_dummy <- sample(c(rep(0, 0.7 * nrow(data)), # Create dummy for splitting rep(1, 0.3 * nrow(data)))) split_dummy # Print dummy # 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 ...
Let’s double-check the frequencies of our dummy:
table(split_dummy) # Table of dummy # 0 1 # 700 300
As you can see, the dummy indicates that 700 observations will be assigned to the training data (i.e. 0) and 300 cases will be assigned to the testing data (i.e. 1).
Now, we can create a train data set as shown below:
data_train <- data[split_dummy == 0, ] # Create train data
Let’s have a look at the first rows of our training data:
head(data_train) # First rows of train data # x1 x2 # 1 0.1016225 1.2073856 # 3 -1.2039263 -0.9865854 # 4 1.4898048 0.4344165 # 5 0.2844304 0.6180946 # 6 0.3927014 2.3363394 # 7 -2.1504326 -3.2133342
As you can see in the previous RStudio console output, the rows 2, 3, 5, 6, 7, and 8 were assigned to the training data.
We can do the same to define our test data:
data_test <- data[split_dummy == 1, ] # Create test data
Let’s also print the head of this data set:
head(data_test) # First rows of test data # x1 x2 # 2 -0.8834578 -1.97783004 # 19 1.5163172 1.01214201 # 22 0.8645558 0.13248672 # 26 -1.0282859 1.95519235 # 30 0.7787774 -1.67456341 # 33 0.0673346 0.08527785
Looks good! Now, you can use these data sets to run your statistical methods such as machine learning algorithms or AB-tests.
Video & Further Resources
Do you need further explanations on the R codes of this article? Then you might want to watch the following video of my YouTube channel. In the video, I’m explaining the examples of this tutorial in RStudio.
The YouTube video will be added soon.
Furthermore, you may want to read the related articles of my website.
- sample Function in R
- Split Data Frame into List of Data Frames Based On ID Column
- Split Data Frame Variable into Multiple Columns
- Introduction to R
In summary: At this point you should have learned how to split data into train and test sets in R. Note that you may use a similar approach to create a validation set as well. Please tell me about it in the comments below, in case you have further questions and/or comments.
Statistics Globe Newsletter
2 Comments. Leave new
When I run this script, split_dummy # Print dummy
# 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 … and not as you have indicated,
split_dummy # Print dummy
# 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 …
therefore output of head(data_train) and head(data_test) won’t be the same as yours. I am getting the right ratio (700:300) but not the same order in the split_dummy sequence.
Hey Scott,
Thanks a lot for the hint, it seems like I have messed something up with the seed.
I have just fixed it.
Thanks again,
Joachim