Module 19 – Advanced Data Manipulation Project – Pt. 1

In Module 19, we focus primarily on the exercises, crafted to mimic real-life data analysis scenarios. While the video lecture serves as a starting point, it provides only foundational guidance, encouraging you to independently explore and tackle the complex tasks at hand using the tidyverse packages in R.

This approach is designed to simulate a realistic work environment, where detailed instructions are often scarce, and your problem-solving skills are crucial. You’re encouraged to actively search for solutions online, leveraging the vast array of resources available to enhance your research and troubleshooting abilities.

Remember, this module is about applying what you’ve learned in a hands-on manner, but you’re not alone in this journey. If you find yourself stuck, feel free to ask questions in the LinkedIn chat group. This project provides an opportunity to solidify your data manipulation skills and prepare yourself for the kinds of challenges you’ll encounter in real-life.

Video Lecture

Exercises

You may download the data set used in this module here (see data attribution below).

Please note, some parts of the following exercises go beyond the previously discussed topics of the course. This is on purpose to mimic a real-life scenario. Please try to find solutions online. However, in case you get stuck, please let us know in the LinkedIn group.

Import the data set allhlcn22.csv.
Keep only rows on state level. Hint: The column name Area Type must be enclosed in backticks (i.e. `Area Type`).
Remove rows where the Annual Average Employment is 0.
Remove summarizing Industry rows (i.e. “10 Total, all industries”, “101 Goods-producing”, “102 Service-providing”).
Keep only the columns St Name, Industry, Annual Average Employment, Annual Total Wages, Annual Average Pay.
Rename the following columns:

St Name to State
Annual Average Employment to Employment
Annual Total Wages to Wages_Total
Annual Average Pay to Pay_Average

Export this data set to an external CSV file.
Import this data set to a new RStudio session. Hint: Your new data set should have 549 rows and 5 columns.
Group your data by Industry and calculate the sums of the grouped Wages_Total and Employment.
Create a new variable called Pay_Industry_Average by calculating grouped Wages_Total / grouped Employment.
Order the grouped data descendingly based on this new variable. Which Industry has the highest and which the lowest Pay_Industry_Average?
Draw a ggplot2 barplot with the following features:

Industry should be on the x-axis.
Pay_Industry_Average should be represented by the y-axis.
The bars should be ordered decreasingly.
The x-axis text should be vertically aligned.
The x-axis label should be Industry.
The y-axis label should be Average Yearly Pay.

Estimate a linear regression model with the formula Pay_Average ~ Industry + Employment and print its summary. Hint: You may use the lm() and summary() functions.
Estimate a linear regression model with the extended formula Pay_Average ~ Industry + State + Employment and print its summary. What’s the difference of this and the previous regression model regarding the predictor variable Employment?

The solutions to these exercises can be found in a video at the bottom of this page.

R Code of This Lecture

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/"  # Specify directory path
 
tib_emp <- read_csv2(str_c(my_path,               # Import CSV file
                           "allhlcn22.csv"))
tib_emp                                           # Print tibble
 
View(tib_emp)                                     # View entire tibble
 
tib_emp %>%                                       # Calculate Annual Average Pay 
  select("Annual Total Wages") %>%
  slice(1) %>%
  pull() /
  tib_emp %>%
  select("Annual Average Employment") %>%
  slice(1) %>%
  pull()
 
tib_emp %>%                                       # Compare calculation with data
  select("Annual Total Wages") %>%
  slice(1) %>%
  pull() ==
  tib_emp %>%
  select("Annual Total Wages") %>%
  slice(2:5) %>%
  pull() %>% 
  sum()
 
tib_emp %>%                                       # Compare sub-categories
  select("Annual Total Wages") %>%
  slice(6) %>%
  pull() ==
  tib_emp %>%
  select("Annual Total Wages") %>%
  slice(7:9) %>%
  pull() %>% 
  sum()

Exercise Solutions Video

Exercise Solutions Code

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/"  # Specify directory path
 
tib_emp <- read_csv2(str_c(my_path,               # Import CSV file
                           "allhlcn22.csv"))
tib_emp                                           # Print tibble
 
tib_emp <- tib_emp %>%                            # Keep only State rows
  filter(`Area Type` == "State")
tib_emp                                           # Print updated tibble
 
tib_emp <- tib_emp %>%                            # Remove 0 Annual Average Employment
  filter(`Annual Average Employment` != 0)
tib_emp                                           # Print updated tibble
 
tib_emp %>%                                       # Vector of industries
  pull(Industry) %>% 
  unique()
 
tib_emp <- tib_emp %>%                            # Remove summarized industry values
  filter(Industry != "10 Total, all industries",
         Industry != "101 Goods-producing",
         Industry != "102 Service-providing")
tib_emp                                           # Print updated tibble
 
tib_emp <- tib_emp %>%                            # Alternative code
  filter(! Industry %in% c("10 Total, all industries",
                           "101 Goods-producing",
                           "102 Service-providing"))
tib_emp                                           # Print updated tibble
 
tib_emp %>%                                       # Vector of updated industries
  pull(Industry) %>% 
  unique()
 
tib_emp <- tib_emp %>%                            # Remove certain columns
  select("St Name",
         "Industry",
         "Annual Average Employment",
         "Annual Total Wages",
         "Annual Average Pay")
tib_emp                                           # Print updated tibble
 
tib_emp <- tib_emp %>%                            # Rename columns
  rename(State = "St Name",
         Employment = "Annual Average Employment",
         Wages_Total = "Annual Total Wages",
         Pay_Average = "Annual Average Pay")
tib_emp                                           # Print updated tibble
 
write_csv(tib_emp, str_c(my_path,                 # Export CSV file
                         "project_data_all.csv"))
 
tib_emp_imp <- read_csv(str_c(my_path,            # Import CSV file
                              "project_data_all.csv"))
tib_emp_imp                                       # Print tibble
 
tib_emp_grouped <- tib_emp_imp %>%                # Group data by Industry
  group_by(Industry) %>% 
  summarize(Wages_Total_Sum = sum(Wages_Total),
            Employment_Sum = sum(Employment))
tib_emp_grouped                                   # Print grouped tibble
 
tib_emp_grouped2 <- tib_emp_grouped %>%           # Calculate average Pay_Industry
  mutate(Pay_Industry_Average = Wages_Total_Sum / Employment_Sum)
tib_emp_grouped2                                  # Print updated tibble
 
tib_emp_grouped3 <- tib_emp_grouped2 %>%          # Order grouped tibble
  arrange(desc(Pay_Industry_Average))
tib_emp_grouped3                                  # Print updated tibble
 
tib_emp_orig <- read_csv2(str_c(my_path,          # Import original CSV file
                                "allhlcn22.csv"))
tib_emp_orig                                      # Print original tibble
 
View(tib_emp_orig)                                # View original tibble
 
tib_emp_grouped3 %>%                              # Draw ggplot2 barplot
  ggplot(aes(x = reorder(Industry,                # Reorder bars
                         Pay_Industry_Average,
                         decreasing = TRUE),
             y = Pay_Industry_Average)) +
  geom_col() +                                    # Specify plot type
  theme(axis.text.x = element_text(angle = 90,    # Vertical x-axis text
                                   hjust = 1,
                                   vjust = 0.3)) +
  xlab("Industry") +                              # Modify x-axis label
  ylab(("Average Yearly Pay"))                    # Modify y-axis label
 
lm(Pay_Average ~ Industry + Employment,           # Estimate regression model
   tib_emp_imp) %>% 
  summary()
 
lm(Pay_Average ~ Industry + State + Employment,   # Extended regression model
   tib_emp_imp) %>% 
  summary()

Data Attribution

This module utilizes data obtained from the U.S. Bureau of Labor Statistics (BLS). The data set allhlcn22.csv, central to our exercises, is sourced from BLS’s extensive data repositories, specifically tailored to provide insights into employment statistics. We acknowledge the U.S. Bureau of Labor Statistics as the primary source of this data, which significantly contributes to the educational value of our course. For more detailed data and additional resources, please visit the U.S. Bureau of Labor Statistics website.

You can access the course overview page, timetable, and table of contents by clicking here.