Module 8 – Working with Rows
In Module 8, you’ll learn how to effectively manage rows within your data sets. This part covers essential techniques for filtering rows based on specific conditions, sorting them in a particular order, and incorporating new rows into your tibbles. By mastering these skills, you can refine your data analysis and ensure that your data sets are organized according to your analytical requirements.
Video Lecture
Exercises
The following exercises are based on the CSV file “My-Programming-Languages”. Please download it here and import it into R. You might watch the lecture on Importing & Exporting Data Using dplyr & readr to get further instructions on how to import external data as well as some information on the content of this data set.
- Filter rows to show only data where
R
‘s popularity is greater than 1.0. - Sort the tibble in descending order based on
Python
‘s popularity. - Find the month where
Matlab
had its highest popularity. - Identify which language has the greatest absolute increase in popularity from the first to the last date in the data set.
- Randomly sample 5% of rows and calculate the average popularity of
R
andPython
for this sample. Don’t forget to set a random seed before. - Add a new row predicting that by January 2024,
R
has become the dominant programming language, universally adopted by 100% of the programming community.
The solutions to these exercises can be found at the bottom of this page.
R Code of This Lecture
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages my_tib <- tibble(x1 = c(1, 3, 5, 3, 3, 2, 4, 2), # Define variables in tibble() x2 = 11:18, x3 = c("a", "b", "a", "c", "b", "b", "a", "c"), x4 = "x") my_tib # Print tibble tib_row_filter <- my_tib %>% # Select rows conditionally filter(x1 == 3) # Apply filter() function tib_row_filter # Print new tibble tib_row_filter2 <- my_tib %>% # Select rows conditionally filter(x1 == 3 | x3 != "a") # Multiple filter() conditions tib_row_filter2 # Print new tibble tib_row_slice <- my_tib %>% # Select rows based on index slice(c(4, 6, 7)) # Apply slice() function tib_row_slice # Print new tibble tib_row_head <- my_tib %>% # Select top rows slice_head(n = 4) # Apply slice_head() function tib_row_head # Print new tibble tib_row_tail <- my_tib %>% # Select bottom rows slice_tail(n = 4) # Apply slice_tail() function tib_row_tail # Print new tibble tib_row_min <- my_tib %>% # Rows with lowest values slice_min(x1, n = 3) # Apply slice_min() function tib_row_min # Print new tibble tib_row_max <- my_tib %>% # Rows with highest values slice_max(x1, n = 3) # Apply slice_max() function tib_row_max # Print new tibble set.seed(3532355) # Ensure reproducibility tib_row_sample <- my_tib %>% # Randomly sample n rows sample_n(3) # Apply sample_n() function tib_row_sample # Print new tibble tib_row_sample2 <- my_tib %>% # Sample percentage of rows sample_frac(0.5) # Apply sample_frac() function tib_row_sample2 # Print new tibble tib_row_arrange <- my_tib %>% # Reorder rows arrange(x1) # Apply arrange() function tib_row_arrange # Print new tibble tib_row_arrange2 <- my_tib %>% # Select rows arrange(desc(x1)) # arrange() & desc() functions tib_row_arrange2 # Print new tibble new_row <- tibble(x1 = 11, # Create tibble with new row x2 = 22, x3 = "aa", x4 = "bb") new_row # Print new row tib_row_add <- my_tib %>% # Add one row bind_rows(new_row) # Apply bind_rows() function tib_row_add # Print new tibble new_row2 <- tibble(x1 = c(11, 111), # Create tibble with new rows x2 = c(22, 222), x3 = c("aa", "aaa"), x4 = c("bb", "bbb")) new_row2 # Print new rows tib_row_add2 <- my_tib %>% # Add one row bind_rows(new_row2) # Apply bind_rows() function tib_row_add2 # Print new tibble tib_row_multi <- my_tib %>% # Apply multiple operations filter(x1 != 1) %>% # Filter rows slice_head(n = 5) %>% # Extract top rows arrange(x3) # Reorder alphabetically tib_row_multi # Print new tibble
Exercise Solutions
Before we can jump into the exercises, we once again have to load the data set that we have created in the module Importing & Exporting Data Using dplyr & readr.
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/" # Specify directory path my_languages_tib <- read_csv(str_c(my_path, # Import CSV file "My-Programming-Languages.csv")) my_languages_tib # Print tibble
Now, we can move on to the exercises.
Exercise 1: Filter rows to show only data where R
‘s popularity is greater than 1.0.
r_popularity_tib <- my_languages_tib %>% # Filter rows with R popularity > 1.0 filter(R > 1.0) r_popularity_tib # Print to console
In the above solution, we used the filter()
function to filter the rows of the tibble, where R’s popularity is greater than 1.0
Exercise 2: Sort the tibble in descending order based on Python
‘s popularity.
sort_descending_tib <- my_languages_tib %>% # Sort rows in descending order of Python's popularity arrange(desc(Python)) sort_descending_tib # Print to console
Here we have sorted the tibble in descending order based on Python
‘s popularity scores using the arrange()
and desc()
functions.
Exercise 3: Find the month where Matlab
had its highest popularity.
matlab_month_tib <- my_languages_tib %>% # Find row (month) where Matlab is the most popular slice_max(Matlab) matlab_month_tib # Print to console
In the solution above, we used the slice_max()
function to find the year in which Matlab
had its highest popularity, which is the year 2010.
Exercise 4: Identify which language has the greatest absolute increase in popularity from the first to the last date in the data set.
start_popularity_tib <- my_languages_tib %>% # Find the first date row slice_head(n = 1) start_popularity_tib # Print to console end_popularity_tib <- my_languages_tib %>% # Find the last date row slice_tail(n = 1) end_popularity_tib # Print to console and compare with the previous
Here we first used the slice_head()
function to get the first row of the tibble and then used the slice_tail()
function to get the last row of the tibble.
That way, we can identify the programming language that has the greatest increase in popularity over time, and that is the Python programming language.
The solution above is quite simple and often already sufficient. However, when dealing with larger data sets, we might prefer a more automated code without manual comparison. We can use the following pipe to return only the programming language with the highest increase.
max_increase_language <- my_languages_tib %>% # Create tibble for output pivot_longer(cols = - Date) %>% # Transform to long data group_by(name) %>% # Group by languages summarize(Change = last(value) - first(value)) %>% # Calculate change ungroup() %>% # Ungroup data arrange(desc(Change)) %>% # Order from highest to lowest slice(1) # Keep only highest value max_increase_language # Print final result
This pipe might be a bit difficult to understand at this point of the course. However, we will talk about those functions and methodologies in more detail in the following modules.
Exercise 5: Randomly sample 5% of rows and calculate the average popularity of R
and Python
for this sample. Don’t forget to set a random seed before.
set.seed(12345) # Set random seed five_percent_rows_tib <- my_languages_tib %>% # Find the average Python and R popularity in sampled rows (5%) sample_frac(0.05) %>% mutate(Avg_pop_python = mean(Python), Avg_pop_r = mean(R)) five_percent_rows_tib # Print to console
Here, we first set a random seed to ensure that our result is reproducible, then we use the slice_frac()
function to randomly select 5% of the number of rows in the tibble. Then, we use the mutate()
function to create new columns in the tibble, which contain the average popularity for Python and R. With the pipe operator, we are able to run more than one function at the same time.
Exercise 6: Add a new row predicting that by January 2024, R
has become the dominant programming language, universally adopted by 100% of the programming community.
new_tib_row <- tibble( # Create a new row where R has 100% popularity Date = "January 2024", "C/C++" = 0, Java = 0, Julia = 0, Matlab = 0, Python = 0, R = 100, Scala = 0 ) r_dom_tib <- my_languages_tib %>% # Add a new row where R has 100% popularity bind_rows(new_tib_row) r_dom_tib # Print to console
In the above solution, we first create a new tibble that contains the row data that we want to add to the original tibble. After creating the new tibble, we use the bind_row()
function to bind it to the original tibble and save the data as r_dom_tib
. Now, when you print the tibble to the console, you will see the new row added to the bottom of the tibble.
Solutions to these exercises were created in collaboration with Ifeanyi Idiaye and Cansu Kebabci. Thanks to them for their contribution!
Further Resources
- dplyr Documentation – Row-wise operations
- Statistics Globe Article – Select Data Frame Rows based on Values in Vector in R
- Statistics Globe Article – Remove Rows with NA Using dplyr Package in R
- Statistics Globe Article – Filter Data Frame Rows Based On Range of Numbers in R
.
You can access the course overview page, timetable, and table of contents by clicking here.