Module 8 – Working with Rows

In Module 8, you’ll learn how to effectively manage rows within your data sets. This part covers essential techniques for filtering rows based on specific conditions, sorting them in a particular order, and incorporating new rows into your tibbles. By mastering these skills, you can refine your data analysis and ensure that your data sets are organized according to your analytical requirements.

Video Lecture

Exercises

The following exercises are based on the CSV file “My-Programming-Languages”. Please download it here and import it into R. You might watch the lecture on Importing & Exporting Data Using dplyr & readr to get further instructions on how to import external data as well as some information on the content of this data set.

Filter rows to show only data where R‘s popularity is greater than 1.0.
Sort the tibble in descending order based on Python‘s popularity.
Find the month where Matlab had its highest popularity.
Identify which language has the greatest absolute increase in popularity from the first to the last date in the data set.
Randomly sample 5% of rows and calculate the average popularity of R and Python for this sample. Don’t forget to set a random seed before.
Add a new row predicting that by January 2024, R has become the dominant programming language, universally adopted by 100% of the programming community.

The solutions to these exercises can be found at the bottom of this page.

R Code of This Lecture

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
my_tib <- tibble(x1 = c(1, 3, 5, 3, 3, 2, 4, 2),  # Define variables in tibble()
                 x2 = 11:18,
                 x3 = c("a", "b", "a", "c", "b", "b", "a", "c"),
                 x4 = "x")
my_tib                                            # Print tibble
 
tib_row_filter <- my_tib %>%                      # Select rows conditionally
  filter(x1 == 3)                                 # Apply filter() function
tib_row_filter                                    # Print new tibble
 
tib_row_filter2 <- my_tib %>%                     # Select rows conditionally
  filter(x1 == 3 | x3 != "a")                     # Multiple filter() conditions
tib_row_filter2                                   # Print new tibble
 
tib_row_slice <- my_tib %>%                       # Select rows based on index
  slice(c(4, 6, 7))                               # Apply slice() function
tib_row_slice                                     # Print new tibble
 
tib_row_head <- my_tib %>%                        # Select top rows
  slice_head(n = 4)                               # Apply slice_head() function
tib_row_head                                      # Print new tibble
 
tib_row_tail <- my_tib %>%                        # Select bottom rows
  slice_tail(n = 4)                               # Apply slice_tail() function
tib_row_tail                                      # Print new tibble
 
tib_row_min <- my_tib %>%                         # Rows with lowest values
  slice_min(x1, n = 3)                            # Apply slice_min() function
tib_row_min                                       # Print new tibble
 
tib_row_max <- my_tib %>%                         # Rows with highest values
  slice_max(x1, n = 3)                            # Apply slice_max() function
tib_row_max                                       # Print new tibble
 
set.seed(3532355)                                 # Ensure reproducibility
 
tib_row_sample <- my_tib %>%                      # Randomly sample n rows
  sample_n(3)                                     # Apply sample_n() function
tib_row_sample                                    # Print new tibble
 
tib_row_sample2 <- my_tib %>%                     # Sample percentage of rows
  sample_frac(0.5)                                # Apply sample_frac() function
tib_row_sample2                                   # Print new tibble
 
tib_row_arrange <- my_tib %>%                     # Reorder rows
  arrange(x1)                                     # Apply arrange() function
tib_row_arrange                                   # Print new tibble
 
tib_row_arrange2 <- my_tib %>%                    # Select rows
  arrange(desc(x1))                               # arrange() & desc() functions
tib_row_arrange2                                  # Print new tibble
 
new_row <- tibble(x1 = 11,                        # Create tibble with new row
                  x2 = 22,
                  x3 = "aa",
                  x4 = "bb")
new_row                                           # Print new row
 
tib_row_add <- my_tib %>%                         # Add one row
  bind_rows(new_row)                              # Apply bind_rows() function
tib_row_add                                       # Print new tibble
 
new_row2 <- tibble(x1 = c(11, 111),               # Create tibble with new rows
                   x2 = c(22, 222),
                   x3 = c("aa", "aaa"),
                   x4 = c("bb", "bbb"))
new_row2                                          # Print new rows
 
tib_row_add2 <- my_tib %>%                        # Add one row
  bind_rows(new_row2)                             # Apply bind_rows() function
tib_row_add2                                      # Print new tibble
 
tib_row_multi <- my_tib %>%                       # Apply multiple operations
  filter(x1 != 1) %>%                             # Filter rows
  slice_head(n = 5) %>%                           # Extract top rows
  arrange(x3)                                     # Reorder alphabetically
tib_row_multi                                     # Print new tibble

Exercise Solutions

Before we can jump into the exercises, we once again have to load the data set that we have created in the module Importing & Exporting Data Using dplyr & readr.

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/"  # Specify directory path
 
my_languages_tib <- read_csv(str_c(my_path,       # Import CSV file
                                   "My-Programming-Languages.csv"))
my_languages_tib                                  # Print tibble

Now, we can move on to the exercises.

Exercise 1: Filter rows to show only data where R‘s popularity is greater than 1.0.

r_popularity_tib <- my_languages_tib %>%          # Filter rows with R popularity > 1.0
  filter(R > 1.0)
 
r_popularity_tib		                  # Print to console

In the above solution, we used the filter() function to filter the rows of the tibble, where R’s popularity is greater than 1.0

Exercise 2: Sort the tibble in descending order based on Python‘s popularity.

sort_descending_tib <- my_languages_tib %>%       # Sort rows in descending order of Python's popularity
  arrange(desc(Python))
 
sort_descending_tib		                  # Print to console

Here we have sorted the tibble in descending order based on Python‘s popularity scores using the arrange() and desc() functions.

Exercise 3: Find the month where Matlab had its highest popularity.

matlab_month_tib <- my_languages_tib %>%          # Find row (month) where Matlab is the most popular
  slice_max(Matlab)
 
matlab_month_tib			          # Print to console

In the solution above, we used the slice_max() function to find the year in which Matlab had its highest popularity, which is the year 2010.

Exercise 4: Identify which language has the greatest absolute increase in popularity from the first to the last date in the data set.

start_popularity_tib <- my_languages_tib %>%      # Find the first date row
  slice_head(n = 1)
 
start_popularity_tib		                  # Print to console
 
end_popularity_tib <- my_languages_tib %>%        # Find the last date row
  slice_tail(n = 1)
 
end_popularity_tib		                  # Print to console and compare with the previous

Here we first used the slice_head() function to get the first row of the tibble and then used the slice_tail() function to get the last row of the tibble.
That way, we can identify the programming language that has the greatest increase in popularity over time, and that is the Python programming language.

The solution above is quite simple and often already sufficient. However, when dealing with larger data sets, we might prefer a more automated code without manual comparison. We can use the following pipe to return only the programming language with the highest increase.

max_increase_language <- my_languages_tib %>%     # Create tibble for output
  pivot_longer(cols = - Date) %>%                 # Transform to long data
  group_by(name) %>%                              # Group by languages
  summarize(Change = last(value) - first(value)) %>%  # Calculate change
  ungroup() %>%                                   # Ungroup data
  arrange(desc(Change)) %>%                       # Order from highest to lowest
  slice(1)                                        # Keep only highest value
 
max_increase_language                             # Print final result

This pipe might be a bit difficult to understand at this point of the course. However, we will talk about those functions and methodologies in more detail in the following modules.

Exercise 5: Randomly sample 5% of rows and calculate the average popularity of R and Python for this sample. Don’t forget to set a random seed before.

set.seed(12345)                                   # Set random seed
 
five_percent_rows_tib <- my_languages_tib %>%     # Find the average Python and R popularity in sampled rows (5%)
  sample_frac(0.05) %>% 
  mutate(Avg_pop_python = mean(Python),
         Avg_pop_r = mean(R))
 
five_percent_rows_tib		                  # Print to console

Here, we first set a random seed to ensure that our result is reproducible, then we use the slice_frac() function to randomly select 5% of the number of rows in the tibble. Then, we use the mutate() function to create new columns in the tibble, which contain the average popularity for Python and R. With the pipe operator, we are able to run more than one function at the same time.

Exercise 6: Add a new row predicting that by January 2024, R has become the dominant programming language, universally adopted by 100% of the programming community.

new_tib_row <- tibble(                            # Create a new row where R has 100% popularity
  Date = "January 2024",
  "C/C++" = 0,
  Java = 0,
  Julia = 0,
  Matlab = 0,
  Python = 0,
  R = 100,
  Scala = 0
)
 
r_dom_tib <- my_languages_tib %>%                 # Add a new row where R has 100% popularity        
  bind_rows(new_tib_row)
 
r_dom_tib			                  # Print to console

In the above solution, we first create a new tibble that contains the row data that we want to add to the original tibble. After creating the new tibble, we use the bind_row() function to bind it to the original tibble and save the data as r_dom_tib. Now, when you print the tibble to the console, you will see the new row added to the bottom of the tibble.

Solutions to these exercises were created in collaboration with Ifeanyi Idiaye and Cansu Kebabci. Thanks to them for their contribution!

Further Resources

You can access the course overview page, timetable, and table of contents by clicking here.