Module 19 – Advanced Data Manipulation Project – Pt. 1
In Module 19, we focus primarily on the exercises, crafted to mimic real-life data analysis scenarios. While the video lecture serves as a starting point, it provides only foundational guidance, encouraging you to independently explore and tackle the complex tasks at hand using the tidyverse packages in R.
This approach is designed to simulate a realistic work environment, where detailed instructions are often scarce, and your problem-solving skills are crucial. You’re encouraged to actively search for solutions online, leveraging the vast array of resources available to enhance your research and troubleshooting abilities.
Remember, this module is about applying what you’ve learned in a hands-on manner, but you’re not alone in this journey. If you find yourself stuck, feel free to ask questions in the LinkedIn chat group. This project provides an opportunity to solidify your data manipulation skills and prepare yourself for the kinds of challenges you’ll encounter in real-life.
Video Lecture
Exercises
You may download the data set used in this module here (see data attribution below).
Please note, some parts of the following exercises go beyond the previously discussed topics of the course. This is on purpose to mimic a real-life scenario. Please try to find solutions online. However, in case you get stuck, please let us know in the LinkedIn group.
- Import the data set
allhlcn22.csv
. - Keep only rows on state level. Hint: The column name
Area Type
must be enclosed in backticks (i.e. `Area Type`). - Remove rows where the
Annual Average Employment
is 0. - Remove summarizing
Industry
rows (i.e. “10 Total, all industries”, “101 Goods-producing”, “102 Service-providing”). - Keep only the columns
St Name
,Industry
,Annual Average Employment
,Annual Total Wages
,Annual Average Pay
. - Rename the following columns:
St Name
toState
Annual Average Employment
toEmployment
Annual Total Wages
toWages_Total
Annual Average Pay
toPay_Average
- Export this data set to an external CSV file.
- Import this data set to a new RStudio session. Hint: Your new data set should have 549 rows and 5 columns.
- Group your data by
Industry
and calculate the sums of the groupedWages_Total
andEmployment
. - Create a new variable called
Pay_Industry_Average
by calculating groupedWages_Total
/ groupedEmployment
. - Order the grouped data descendingly based on this new variable. Which
Industry
has the highest and which the lowestPay_Industry_Average
? - Draw a ggplot2 barplot with the following features:
Industry
should be on the x-axis.Pay_Industry_Average
should be represented by the y-axis.- The bars should be ordered decreasingly.
- The x-axis text should be vertically aligned.
- The x-axis label should be
Industry
. - The y-axis label should be
Average Yearly Pay
. - Estimate a linear regression model with the formula
Pay_Average ~ Industry + Employment
and print its summary. Hint: You may use thelm()
andsummary()
functions. - Estimate a linear regression model with the extended formula
Pay_Average ~ Industry + State + Employment
and print its summary. What’s the difference of this and the previous regression model regarding the predictor variableEmployment
?
The solutions to these exercises can be found in a video at the bottom of this page.
R Code of This Lecture
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/" # Specify directory path tib_emp <- read_csv2(str_c(my_path, # Import CSV file "allhlcn22.csv")) tib_emp # Print tibble View(tib_emp) # View entire tibble tib_emp %>% # Calculate Annual Average Pay select("Annual Total Wages") %>% slice(1) %>% pull() / tib_emp %>% select("Annual Average Employment") %>% slice(1) %>% pull() tib_emp %>% # Compare calculation with data select("Annual Total Wages") %>% slice(1) %>% pull() == tib_emp %>% select("Annual Total Wages") %>% slice(2:5) %>% pull() %>% sum() tib_emp %>% # Compare sub-categories select("Annual Total Wages") %>% slice(6) %>% pull() == tib_emp %>% select("Annual Total Wages") %>% slice(7:9) %>% pull() %>% sum()
Exercise Solutions Video
Exercise Solutions Code
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/" # Specify directory path tib_emp <- read_csv2(str_c(my_path, # Import CSV file "allhlcn22.csv")) tib_emp # Print tibble tib_emp <- tib_emp %>% # Keep only State rows filter(`Area Type` == "State") tib_emp # Print updated tibble tib_emp <- tib_emp %>% # Remove 0 Annual Average Employment filter(`Annual Average Employment` != 0) tib_emp # Print updated tibble tib_emp %>% # Vector of industries pull(Industry) %>% unique() tib_emp <- tib_emp %>% # Remove summarized industry values filter(Industry != "10 Total, all industries", Industry != "101 Goods-producing", Industry != "102 Service-providing") tib_emp # Print updated tibble tib_emp <- tib_emp %>% # Alternative code filter(! Industry %in% c("10 Total, all industries", "101 Goods-producing", "102 Service-providing")) tib_emp # Print updated tibble tib_emp %>% # Vector of updated industries pull(Industry) %>% unique() tib_emp <- tib_emp %>% # Remove certain columns select("St Name", "Industry", "Annual Average Employment", "Annual Total Wages", "Annual Average Pay") tib_emp # Print updated tibble tib_emp <- tib_emp %>% # Rename columns rename(State = "St Name", Employment = "Annual Average Employment", Wages_Total = "Annual Total Wages", Pay_Average = "Annual Average Pay") tib_emp # Print updated tibble write_csv(tib_emp, str_c(my_path, # Export CSV file "project_data_all.csv")) tib_emp_imp <- read_csv(str_c(my_path, # Import CSV file "project_data_all.csv")) tib_emp_imp # Print tibble tib_emp_grouped <- tib_emp_imp %>% # Group data by Industry group_by(Industry) %>% summarize(Wages_Total_Sum = sum(Wages_Total), Employment_Sum = sum(Employment)) tib_emp_grouped # Print grouped tibble tib_emp_grouped2 <- tib_emp_grouped %>% # Calculate average Pay_Industry mutate(Pay_Industry_Average = Wages_Total_Sum / Employment_Sum) tib_emp_grouped2 # Print updated tibble tib_emp_grouped3 <- tib_emp_grouped2 %>% # Order grouped tibble arrange(desc(Pay_Industry_Average)) tib_emp_grouped3 # Print updated tibble tib_emp_orig <- read_csv2(str_c(my_path, # Import original CSV file "allhlcn22.csv")) tib_emp_orig # Print original tibble View(tib_emp_orig) # View original tibble tib_emp_grouped3 %>% # Draw ggplot2 barplot ggplot(aes(x = reorder(Industry, # Reorder bars Pay_Industry_Average, decreasing = TRUE), y = Pay_Industry_Average)) + geom_col() + # Specify plot type theme(axis.text.x = element_text(angle = 90, # Vertical x-axis text hjust = 1, vjust = 0.3)) + xlab("Industry") + # Modify x-axis label ylab(("Average Yearly Pay")) # Modify y-axis label lm(Pay_Average ~ Industry + Employment, # Estimate regression model tib_emp_imp) %>% summary() lm(Pay_Average ~ Industry + State + Employment, # Extended regression model tib_emp_imp) %>% summary()
Data Attribution
This module utilizes data obtained from the U.S. Bureau of Labor Statistics (BLS). The data set allhlcn22.csv
, central to our exercises, is sourced from BLS’s extensive data repositories, specifically tailored to provide insights into employment statistics. We acknowledge the U.S. Bureau of Labor Statistics as the primary source of this data, which significantly contributes to the educational value of our course. For more detailed data and additional resources, please visit the U.S. Bureau of Labor Statistics website.
.
You can access the course overview page, timetable, and table of contents by clicking here.