Module 20 – Advanced Data Manipulation Project – Pt. 2

In Module 20, you’ll continue to build on the skills and techniques you’ve developed in the course. This module further immerses you in a realistic data analysis project, enhancing your ability to handle complex data manipulation tasks using the tidyverse package in R.

The module’s focus remains on the exercises. The video lecture provides a foundational overview, but the main emphasis is on you applying your skills to independently analyze and manipulate the data. This part of the project involves importing and inspecting multiple CSV files corresponding to different quarters of a year, each containing unique data sets.

As with the previous module, you’re encouraged to explore and find solutions on your own, simulating a real-world work environment where detailed instructions aren’t always available. This approach helps in refining your problem-solving and research skills, which are vital in any data analysis role.

However, remember that support is still available. If you encounter challenges or have questions, don’t hesitate to ask for help in the LinkedIn chat group.

Video Lecture

Exercises

As in the Pt. 1 of the Advanced Data Manipulation Project, some parts of the following exercises might go beyond the previously discussed topics of the course. Please let us know in the LinkedIn group in case you get stuck.

You may download the data sets used in this module here (see data attribution below):

Here are the exercises:

Import the four quarterly data sets allhlcn221.csv, allhlcn222.csv, allhlcn223.csv, and allhlcn224.csv.
Keep only the rows where Area is equal to “U.S. TOTAL”, and Industry is unequal to “10 Total, all industries”, “101 Goods-producing”, and “102 Service-providing”.
Merge the quarterly data into a single data set using a full join.
Rename the monthly employment columns to month abbreviations (e.g. “Jan” = “January Employment”).
Keep only Industry and the month columns.
Transform the data to long format. The long data should contain the columns Industry, name, and value.
Convert the name column to the factor class and specify the ordering of the factor levels. Hint: You may use the factor() function and the levels argument.
Draw a ggplot2 line plot with the month names on the x-axis and the employment values on the y-axis. Group and color the lines by Industry. Set the linewidth argument within the geom_line() function to 2.
Which industry employs the most people?
What do you notice regarding the “Leisure and hospitality” industry? Hint: Check the variance of the line.
Draw a ggplot2 boxplot with the employment values on the y-axis and a fill color based on Industry. Remove the x-axis text and ticks.
Is your observation regarding the “Leisure and hospitality” Industry confirmed?
Group your data based on Industry and calculate the grouped variance. Sort the output descendingly.

The solutions to these exercises can be found in a video at the bottom of this page.

R Code of This Lecture

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/"  # Specify directory path
 
tib_q1 <- read_csv2(str_c(my_path,                # Import Q1 CSV file
                          "allhlcn221.csv"))
tib_q1                                            # Print Q1 tibble
 
tib_q2 <- read_csv2(str_c(my_path,                # Import Q2 CSV file
                          "allhlcn222.csv"))
tib_q2                                            # Print Q2 tibble
 
tib_q3 <- read_csv2(str_c(my_path,                # Import Q3 CSV file
                          "allhlcn223.csv"))
tib_q3                                            # Print Q3 tibble
 
tib_q4 <- read_csv2(str_c(my_path,                # Import Q4 CSV file
                          "allhlcn224.csv"))
tib_q4                                            # Print Q4 tibble
 
View(tib_q1)                                      # Inspect entire Q1 data
 
View(tib_q2)                                      # Inspect entire Q2 data

Exercise Solutions Video

Exercise Solutions Code

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/"  # Specify directory path
 
tib_q1 <- read_csv2(str_c(my_path,                # Import Q1 CSV file
                          "allhlcn221.csv"))
tib_q1                                            # Print Q1 tibble
 
tib_q2 <- read_csv2(str_c(my_path,                # Import Q2 CSV file
                          "allhlcn222.csv"))
tib_q2                                            # Print Q2 tibble
 
tib_q3 <- read_csv2(str_c(my_path,                # Import Q3 CSV file
                          "allhlcn223.csv"))
tib_q3                                            # Print Q3 tibble
 
tib_q4 <- read_csv2(str_c(my_path,                # Import Q4 CSV file
                          "allhlcn224.csv"))
tib_q4                                            # Print Q4 tibble
 
tib_q1 <- tib_q1 %>%                              # Filter rows of Q1 tibble
  filter(Area == "U.S. TOTAL",
         Industry != "10 Total, all industries",
         Industry != "101 Goods-producing",
         Industry != "102 Service-providing")
tib_q1                                            # Print updated Q1 tibble
 
tib_q2 <- tib_q2 %>%                              # Filter rows of Q1 tibble
  filter(Area == "U.S. TOTAL",
         Industry != "10 Total, all industries",
         Industry != "101 Goods-producing",
         Industry != "102 Service-providing")
tib_q2                                            # Print updated Q1 tibble
 
tib_q3 <- tib_q3 %>%                              # Filter rows of Q1 tibble
  filter(Area == "U.S. TOTAL",
         Industry != "10 Total, all industries",
         Industry != "101 Goods-producing",
         Industry != "102 Service-providing")
tib_q3                                            # Print updated Q1 tibble
 
tib_q4 <- tib_q4 %>%                              # Filter rows of Q1 tibble
  filter(Area == "U.S. TOTAL",
         Industry != "10 Total, all industries",
         Industry != "101 Goods-producing",
         Industry != "102 Service-providing")
tib_q4                                            # Print updated Q1 tibble
 
tib_all <- tib_q1 %>%                             # Join quarterly tibbles
  full_join(tib_q2, by = "Industry") %>% 
  full_join(tib_q3, by = "Industry") %>% 
  full_join(tib_q4, by = "Industry")
tib_all                                           # Print combined tibble
 
tib_all <- tib_all %>%                            # Rename month columns
  rename("Jan" = "January Employment",
         "Feb" = "February Employment",
         "Mar" = "March Employment",
         "Apr" = "April Employment",
         "May" = "May Employment",
         "Jun" = "June Employment",
         "Jul" = "July Employment",
         "Aug" = "August Employment",
         "Sep" = "September Employment",
         "Oct" = "October Employment",
         "Nov" = "November Employment",
         "Dec" = "December Employment")
tib_all                                           # Print updated tibble
 
tib_all <- tib_all %>%                            # Keep only certain columns
  select(Industry,
         Jan, Feb, Mar, Apr, May, Jun,
         Jul, Aug, Sep, Oct, Nov, Dec)
tib_all                                           # Print updated tibble
 
tib_all_long <- tib_all %>%                       # Transform to long format
  pivot_longer(colnames(tib_all)[- 1])
tib_all_long                                      # Print long tibble
 
tib_all_long <- tib_all_long %>%                  # Convert to factor & set ordering
  mutate(name = factor(name, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
                                        "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
tib_all_long                                      # Print updated tibble
 
tib_all_long %>%                                  # Draw ggplot2 line plot
  ggplot(aes(x = name,
             y = value,
             group = Industry,
             col = Industry)) +
  geom_line(linewidth = 2)
 
tib_all_long %>%                                  # Draw ggplot2 boxplot
  ggplot(aes(y = value,
             fill = Industry)) +
  geom_boxplot() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())
 
tib_all_long %>%                                  # Calculate grouped variance
  group_by(Industry) %>% 
  summarize(Industry_Variance = var(value)) %>% 
  arrange(desc(Industry_Variance))

Data Attribution

This module utilizes data obtained from the U.S. Bureau of Labor Statistics (BLS). The data sets allhlcn221.csv, allhlcn222.csv, allhlcn223.csv, and allhlcn224.csv, central to our exercises, are sourced from BLS’s extensive data repositories, specifically tailored to provide insights into employment statistics. We acknowledge the U.S. Bureau of Labor Statistics as the primary source of these data sets, which significantly contribute to the educational value of our course. For more detailed data and additional resources, please visit the U.S. Bureau of Labor Statistics website.

You can access the course overview page, timetable, and table of contents by clicking here.