Module 17 – Handling Character Strings Using dplyr & stringr

Mastery of string manipulation is essential for anyone working with text data in R! Module 17 focuses on handling character strings in R using dplyr and stringr, crucial packages in the tidyverse. The video lecture in this module presents detailed instructions on manipulating and processing character strings in data cleaning and preparation. The R code examples illustrate a range of string operations such as concatenation, pattern detection, string replacement, length calculation, and conversion between lower and upper case. Practical exercises are included at the end of the module to reinforce these techniques.

Video Lecture

Exercises

In this and the next module, we’ll use the same raw data from the National Centers for Environmental Information that we’ve already used in Module 16 (data attribution). However, since we’ll focus on different topics, we’ll prepare the exercise data differently (see below).

In case you’d like to skip the data preparation for these exercises, you may download the final data set here.

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/"  # Specify directory path
 
tib_la <- read_csv(str_c(my_path,                 # Import Los Angeles CSV file
                         "72287493134.csv"))
tib_la                                            # Print tibble
 
tib_ny <- read_csv(str_c(my_path,                 # Import New York CSV file
                         "72504094702.csv"))
tib_ny                                            # Print tibble
 
tib_la_sub <- tib_la %>%                          # Create LA tibble subset
  select(DATE,                                    # Select columns
         NAME,
         REPORT_TYPE,
         HourlyDryBulbTemperature) %>% 
  rename(date = DATE,                             # Rename columns
         loc = NAME,
         type = REPORT_TYPE,
         temp = HourlyDryBulbTemperature)
tib_la_sub                                        # Print LA tibble subset
 
tib_ny_sub <- tib_ny %>%                          # Create NY tibble subset
  select(DATE,                                    # Select columns
         NAME,
         REPORT_TYPE,
         HourlyDryBulbTemperature) %>% 
  rename(date = DATE,                             # Rename columns
         loc = NAME,
         type = REPORT_TYPE,
         temp = HourlyDryBulbTemperature)
tib_ny_sub                                        # Print NY tibble subset
 
tib_br <- bind_rows(tib_la_sub,                   # Combine LA & NY data by rows
                      tib_ny_sub) %>%
  filter(row_number() %% 10 == 0)                 # Reduce size of data
tib_br                                            # Print final data set
 
tib_br %>%                                        # Export CSV file
  write_csv(str_c(my_path, "temp-data-bind-rows.csv"))

Data preparation done! Let’s move on to the exercises:

Import the temperature data set temp-data-bind-rows.csv into R and store it in a tibble named temp_data2.
Create a new column in temp_data2 using str_detect() to identify rows where the loc column contains “DOWNTOWN”.
Use the new column to calculate the percentage of rows containing the word “DOWNTOWN”.
Utilize str_replace() to change “LOS ANGELES” to “LA” in the loc column of temp_data2.
Concatenate the loc and type columns using str_c(), adding a separator, and save this in a new column.
Add the suffix “Degrees Fahrenheit” to the values in the temp column.
Apply str_length() to determine the length of the strings in the newly concatenated column in temp_data2.
Convert the loc column to uppercase using str_to_upper(), and the type column to lowercase using str_to_lower().
Extract the first 5 characters from the loc column using str_sub() and store them in a new column.

The solutions to these exercises can be found at the bottom of this page.

R Code of This Lecture

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
set.seed(172905)                                  # Ensure reproducibility
 
tib_str <- tibble(x = replicate(                  # Create tibble
  10,                                             # Replicate 10 times
  str_c(                                          # Combine characters
    sample(                                       # Apply sample()
      c(letters, LETTERS),                        # Sample from all letters
      sample(1:10, 1)),                           # Sample up to 10 letters
    collapse = "")))                              # Combine without space
tib_str                                           # Print tibble
 
tib_str <- tib_str %>%                            # Concatenate strings
  mutate(str_c = str_c("ID_", x))                 # Apply str_c()
tib_str                                           # Print updated tibble
 
tib_str <- tib_str %>%                            # Detect a pattern
  mutate(str_detect = str_detect(x, "e"))         # Apply str_detect()
tib_str                                           # Print updated tibble
 
tib_str <- tib_str %>%                            # Replace a pattern
  mutate(str_replace = str_replace(x, "e", "99")) # Apply str_replace()
tib_str                                           # Print updated tibble
 
tib_str <- tib_str %>%                            # Get length of strings
  mutate(str_length = str_length(x))              # Apply str_length()
tib_str                                           # Print updated tibble
 
tib_str <- tib_str %>%                            # Convert to lowercase
  mutate(str_to_lower = str_to_lower(x))          # Apply str_to_lower()
tib_str                                           # Print updated tibble
 
tib_str <- tib_str %>%                            # Convert to uppercase
  mutate(str_to_upper = str_to_upper(x))          # Apply str_to_upper()
tib_str                                           # Print updated tibble
 
tib_str <- tib_str %>%                            # All functions in one pipe
  mutate(str_c = str_c("ID_", x)) %>%             # Apply str_c()
  mutate(str_detect = str_detect(x, "e")) %>%     # Apply str_detect()
  mutate(str_replace = str_replace(x, "e", "99")) %>%  # Apply str_replace()
  mutate(str_length = str_length(x)) %>%          # Apply str_length()
  mutate(str_to_lower = str_to_lower(x)) %>%      # Apply str_to_lower()
  mutate(str_to_upper = str_to_upper(x))          # Apply str_to_upper()
tib_str                                           # Print updated tibble
 
tib_str <- tib_str %>%                            # All functions in mutate()
  mutate(str_c = str_c("ID_", x),                 # Apply str_c()
         str_detect = str_detect(x, "e"),         # Apply str_detect()
         str_replace = str_replace(x, "e", "99"), # Apply str_replace()
         str_length = str_length(x),              # Apply str_length()
         str_to_lower = str_to_lower(x),          # Apply str_to_lower()
         str_to_upper = str_to_upper(x))          # Apply str_to_upper()
tib_str                                           # Print updated tibble

Exercise Solutions

Below, you can find our solutions for the exercises of this module. Before beginning the exercises, we will install and load the tidyverse package.

install.packages("tidyverse")	                                        # Install tidyverse packages
library(tidyverse)	                                                # Load tidyverse packages

With the tidyverse packages installed and loaded, we can now proceed to the exercise solutions.

Exercise 1: Import the temperature data set temp-data-bind-rows.csv into R and store it in a tibble named temp_data2.

my_path = "path to CSV file on your computer"		                # Specify directory path
 
temp_data2 <- read_csv(str_c(my_path,"temp-data-bind-rows.csv"))	# Import temperature CSV file
 
temp_data2	                                                        # Print tibble

Above, we imported the CSV file using read_csv() in combination with str_c().

Exercise 2: Create a new column in temp_data2 using str_detect() to identify rows where the loc column contains “DOWNTOWN”.

temp_data2 <- temp_data2 %>%		                                # Detect a pattern
  mutate(str_detect = str_detect(loc,"DOWNTOWN"))	                # Apply str_detect()
 
temp_data2	                                                        # Print updated tibble

In the above example, we used the str_detect() function in combination with the mutate() function to create a new column str_detect in the tibble which identifies the rows in loc that contain the word “DOWNTOWN”. If “DOWNTOWN” exists in the row, then it returns TRUE in the column.

Exercise 3: Use the new column to calculate the percentage of rows containing the word “DOWNTOWN”.

percentage_rows <- temp_data2 %>%                                       # Define tibble
  filter(str_detect == "TRUE") %>% 	                                # Filter tibble by logical constant
  count(str_detect) %>%		                                        # Count number of rows containing logical constant
  mutate(percentage = (n/nrow(temp_data2) * 100))	                # Calculate percentage
 
percentage_rows		                                                # Print tibble

Here we first used the filter() function to filter the tibble using the str_detect column, then we used the count() function to count the number of rows containing TRUE values.

After that, we created a new column using the mutate() function, where we divided the count by the number of rows in the tibble to get the percentage of rows containing the word “DOWNTOWN”.

Exercise 4: Utilize str_replace() to change “LOS ANGELES” to “LA” in the loc column of temp_data2.

temp_data2 <- temp_data2 %>%		                               # Replace a pattern 
  mutate(loc = str_replace(loc,"LOS ANGELES","LA"))	               # Apply str_replace()
 
temp_data2	                                                       # Print updated tibble

In the example above, we use the str_replace() function in combination with the mutate() function to find and replace “LOS ANGELES” with “LA” in the column loc.

Exercise 5: Concatenate the loc and type columns using str_c(), adding a separator, and save this in a new column.

temp_data2 <- temp_data2 %>%		                               # Concatenate strings 
  mutate(loc_type = str_c(loc,type,sep = "+"))		               # Apply str_c()
 
temp_data2	                                                       # Print updated tibble

Above, we used the str_c() function in combination with the mutate() function to concatenate the rows of the loc and type columns, separated by the + sign, and save that in a new column called loc_type. Note: Using another separator would also be fine.

Exercise 6: Add the suffix “Degrees Fahrenheit” to the values in the temp column.

temp_data2 <- temp_data2 %>%		                               # Concatenate strings 
  mutate(temp = str_c("Degrees Fahrenheit ",temp))	               # Apply str_c() function
 
temp_data2	                                                       # Print updated tibble

Here we used the str_c() function in combination with the mutate() function to create a new column temp wherein we added the prefix “Degrees Fahrenheit” to the rows in the temp column.

Exercise 7: Apply str_length() to determine the length of the strings in the newly concatenated column in temp_data2.

temp_data2 <- temp_data2 %>%		                               # Get length of strings 
  mutate(str_length = str_length(loc_type))		               # Apply str_length() function
 
temp_data2	                                                       # Print updated tibble

In the above example, we used the str_length() function in combination with the mutate() function to find the length of the strings in the rows of the column loc_type.

Then we stored the results in a new column “str_length”.

Exercise 8: Convert the loc column to uppercase using str_to_upper(), and the type column to lowercase using str_to_lower().

temp_data2 <- temp_data2 %>%		                               # Convert case
  mutate(loc = str_to_upper(loc),	                               # Apply str_to_upper() function
         type = str_to_lower(type))	                               # Apply str_to_lower() function
 
temp_data2	                                                       # Print updated tibble

In the code above, we changed the case of the strings in both the loc and type columns to uppercase and lowercase respectively using the str_to_uppercase() and str_to_lowercase() functions in combination with the mutate() function.

Exercise 9: Extract the first 5 characters from the loc column using str_sub() and store them in a new column.

temp_data2 <- temp_data2 %>%		                               # Extract first 5 characters 
  mutate(loc_chr = str_sub(loc,1,6))	                               # Apply str_sub() function
 
temp_data2	                                                       # Print updated tibble

Here we used the str_sub() function in combination with the mutate() function to extract the first 5 characters in the loc column and store it in a new column loc_chr. The first integer in the str_sub() function refers to the first character position, the second integer refers to the last character position. Please be aware that there are blank characters " " within the string.

Solutions to these exercises were created in collaboration with Ifeanyi Idiaye and Cansu Kebabci. Thanks to them for their contribution!

Further Resources

You can access the course overview page, timetable, and table of contents by clicking here.