Module 17 – Handling Character Strings Using dplyr & stringr
Mastery of string manipulation is essential for anyone working with text data in R! Module 17 focuses on handling character strings in R using dplyr and stringr, crucial packages in the tidyverse. The video lecture in this module presents detailed instructions on manipulating and processing character strings in data cleaning and preparation. The R code examples illustrate a range of string operations such as concatenation, pattern detection, string replacement, length calculation, and conversion between lower and upper case. Practical exercises are included at the end of the module to reinforce these techniques.
Video Lecture
Exercises
In this and the next module, we’ll use the same raw data from the National Centers for Environmental Information that we’ve already used in Module 16 (data attribution). However, since we’ll focus on different topics, we’ll prepare the exercise data differently (see below).
In case you’d like to skip the data preparation for these exercises, you may download the final data set here.
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/" # Specify directory path tib_la <- read_csv(str_c(my_path, # Import Los Angeles CSV file "72287493134.csv")) tib_la # Print tibble tib_ny <- read_csv(str_c(my_path, # Import New York CSV file "72504094702.csv")) tib_ny # Print tibble tib_la_sub <- tib_la %>% # Create LA tibble subset select(DATE, # Select columns NAME, REPORT_TYPE, HourlyDryBulbTemperature) %>% rename(date = DATE, # Rename columns loc = NAME, type = REPORT_TYPE, temp = HourlyDryBulbTemperature) tib_la_sub # Print LA tibble subset tib_ny_sub <- tib_ny %>% # Create NY tibble subset select(DATE, # Select columns NAME, REPORT_TYPE, HourlyDryBulbTemperature) %>% rename(date = DATE, # Rename columns loc = NAME, type = REPORT_TYPE, temp = HourlyDryBulbTemperature) tib_ny_sub # Print NY tibble subset tib_br <- bind_rows(tib_la_sub, # Combine LA & NY data by rows tib_ny_sub) %>% filter(row_number() %% 10 == 0) # Reduce size of data tib_br # Print final data set tib_br %>% # Export CSV file write_csv(str_c(my_path, "temp-data-bind-rows.csv"))
Data preparation done! Let’s move on to the exercises:
- Import the temperature data set
temp-data-bind-rows.csv
into R and store it in a tibble namedtemp_data2
. - Create a new column in
temp_data2
usingstr_detect()
to identify rows where theloc
column contains “DOWNTOWN”. - Use the new column to calculate the percentage of rows containing the word “DOWNTOWN”.
- Utilize
str_replace()
to change “LOS ANGELES” to “LA” in theloc
column oftemp_data2
. - Concatenate the
loc
andtype
columns usingstr_c()
, adding a separator, and save this in a new column. - Add the suffix “Degrees Fahrenheit” to the values in the
temp
column. - Apply
str_length()
to determine the length of the strings in the newly concatenated column intemp_data2
. - Convert the
loc
column to uppercase usingstr_to_upper()
, and thetype
column to lowercase usingstr_to_lower()
. - Extract the first 5 characters from the
loc
column usingstr_sub()
and store them in a new column.
The solutions to these exercises can be found at the bottom of this page.
R Code of This Lecture
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages set.seed(172905) # Ensure reproducibility tib_str <- tibble(x = replicate( # Create tibble 10, # Replicate 10 times str_c( # Combine characters sample( # Apply sample() c(letters, LETTERS), # Sample from all letters sample(1:10, 1)), # Sample up to 10 letters collapse = ""))) # Combine without space tib_str # Print tibble tib_str <- tib_str %>% # Concatenate strings mutate(str_c = str_c("ID_", x)) # Apply str_c() tib_str # Print updated tibble tib_str <- tib_str %>% # Detect a pattern mutate(str_detect = str_detect(x, "e")) # Apply str_detect() tib_str # Print updated tibble tib_str <- tib_str %>% # Replace a pattern mutate(str_replace = str_replace(x, "e", "99")) # Apply str_replace() tib_str # Print updated tibble tib_str <- tib_str %>% # Get length of strings mutate(str_length = str_length(x)) # Apply str_length() tib_str # Print updated tibble tib_str <- tib_str %>% # Convert to lowercase mutate(str_to_lower = str_to_lower(x)) # Apply str_to_lower() tib_str # Print updated tibble tib_str <- tib_str %>% # Convert to uppercase mutate(str_to_upper = str_to_upper(x)) # Apply str_to_upper() tib_str # Print updated tibble tib_str <- tib_str %>% # All functions in one pipe mutate(str_c = str_c("ID_", x)) %>% # Apply str_c() mutate(str_detect = str_detect(x, "e")) %>% # Apply str_detect() mutate(str_replace = str_replace(x, "e", "99")) %>% # Apply str_replace() mutate(str_length = str_length(x)) %>% # Apply str_length() mutate(str_to_lower = str_to_lower(x)) %>% # Apply str_to_lower() mutate(str_to_upper = str_to_upper(x)) # Apply str_to_upper() tib_str # Print updated tibble tib_str <- tib_str %>% # All functions in mutate() mutate(str_c = str_c("ID_", x), # Apply str_c() str_detect = str_detect(x, "e"), # Apply str_detect() str_replace = str_replace(x, "e", "99"), # Apply str_replace() str_length = str_length(x), # Apply str_length() str_to_lower = str_to_lower(x), # Apply str_to_lower() str_to_upper = str_to_upper(x)) # Apply str_to_upper() tib_str # Print updated tibble
Exercise Solutions
Below, you can find our solutions for the exercises of this module. Before beginning the exercises, we will install and load the tidyverse
package.
install.packages("tidyverse") # Install tidyverse packages library(tidyverse) # Load tidyverse packages
With the tidyverse
packages installed and loaded, we can now proceed to the exercise solutions.
Exercise 1: Import the temperature data set temp-data-bind-rows.csv
into R and store it in a tibble named temp_data2
.
my_path = "path to CSV file on your computer" # Specify directory path temp_data2 <- read_csv(str_c(my_path,"temp-data-bind-rows.csv")) # Import temperature CSV file temp_data2 # Print tibble
Above, we imported the CSV file using read_csv()
in combination with str_c()
.
Exercise 2: Create a new column in temp_data2
using str_detect()
to identify rows where the loc
column contains “DOWNTOWN”.
temp_data2 <- temp_data2 %>% # Detect a pattern mutate(str_detect = str_detect(loc,"DOWNTOWN")) # Apply str_detect() temp_data2 # Print updated tibble
In the above example, we used the str_detect()
function in combination with the mutate()
function to create a new column str_detect
in the tibble which identifies the rows in loc
that contain the word “DOWNTOWN”. If “DOWNTOWN” exists in the row, then it returns TRUE
in the column.
Exercise 3: Use the new column to calculate the percentage of rows containing the word “DOWNTOWN”.
percentage_rows <- temp_data2 %>% # Define tibble filter(str_detect == "TRUE") %>% # Filter tibble by logical constant count(str_detect) %>% # Count number of rows containing logical constant mutate(percentage = (n/nrow(temp_data2) * 100)) # Calculate percentage percentage_rows # Print tibble
Here we first used the filter()
function to filter the tibble using the str_detect
column, then we used the count()
function to count the number of rows containing TRUE
values.
After that, we created a new column using the mutate()
function, where we divided the count by the number of rows in the tibble to get the percentage of rows containing the word “DOWNTOWN”.
Exercise 4: Utilize str_replace()
to change “LOS ANGELES” to “LA” in the loc
column of temp_data2
.
temp_data2 <- temp_data2 %>% # Replace a pattern mutate(loc = str_replace(loc,"LOS ANGELES","LA")) # Apply str_replace() temp_data2 # Print updated tibble
In the example above, we use the str_replace()
function in combination with the mutate()
function to find and replace “LOS ANGELES” with “LA” in the column loc
.
Exercise 5: Concatenate the loc
and type
columns using str_c()
, adding a separator, and save this in a new column.
temp_data2 <- temp_data2 %>% # Concatenate strings mutate(loc_type = str_c(loc,type,sep = "+")) # Apply str_c() temp_data2 # Print updated tibble
Above, we used the str_c()
function in combination with the mutate()
function to concatenate the rows of the loc
and type
columns, separated by the +
sign, and save that in a new column called loc_type
. Note: Using another separator would also be fine.
Exercise 6: Add the suffix “Degrees Fahrenheit” to the values in the temp
column.
temp_data2 <- temp_data2 %>% # Concatenate strings mutate(temp = str_c("Degrees Fahrenheit ",temp)) # Apply str_c() function temp_data2 # Print updated tibble
Here we used the str_c()
function in combination with the mutate()
function to create a new column temp
wherein we added the prefix “Degrees Fahrenheit” to the rows in the temp
column.
Exercise 7: Apply str_length()
to determine the length of the strings in the newly concatenated column in temp_data2
.
temp_data2 <- temp_data2 %>% # Get length of strings mutate(str_length = str_length(loc_type)) # Apply str_length() function temp_data2 # Print updated tibble
In the above example, we used the str_length()
function in combination with the mutate()
function to find the length of the strings in the rows of the column loc_type
.
Then we stored the results in a new column “str_length”.
Exercise 8: Convert the loc
column to uppercase using str_to_upper()
, and the type
column to lowercase using str_to_lower()
.
temp_data2 <- temp_data2 %>% # Convert case mutate(loc = str_to_upper(loc), # Apply str_to_upper() function type = str_to_lower(type)) # Apply str_to_lower() function temp_data2 # Print updated tibble
In the code above, we changed the case of the strings in both the loc
and type
columns to uppercase and lowercase respectively using the str_to_uppercase()
and str_to_lowercase()
functions in combination with the mutate()
function.
Exercise 9: Extract the first 5 characters from the loc
column using str_sub()
and store them in a new column.
temp_data2 <- temp_data2 %>% # Extract first 5 characters mutate(loc_chr = str_sub(loc,1,6)) # Apply str_sub() function temp_data2 # Print updated tibble
Here we used the str_sub()
function in combination with the mutate()
function to extract the first 5 characters in the loc
column and store it in a new column loc_chr
. The first integer in the str_sub()
function refers to the first character position, the second integer refers to the last character position. Please be aware that there are blank characters " "
within the string.
Solutions to these exercises were created in collaboration with Ifeanyi Idiaye and Cansu Kebabci. Thanks to them for their contribution!
Further Resources
- Statistics Globe Article – Test if Character is in String in R
- Statistics Globe Article – Select Rows with Partial String Match in R
- Statistics Globe Article – Find Elements in String Vector that Contain Certain Character in R
- Statistics Globe Article – Find Character Pattern in Data Frame Column in R
- Lennert F. (2022) – String manipulation with stringr and regular expressions / An(other) introduction to R
- stringr Documentation – stringr
.
You can access the course overview page, timetable, and table of contents by clicking here.