Module 20 – Advanced Data Manipulation Project – Pt. 2
In Module 20, you’ll continue to build on the skills and techniques you’ve developed in the course. This module further immerses you in a realistic data analysis project, enhancing your ability to handle complex data manipulation tasks using the tidyverse package in R.
The module’s focus remains on the exercises. The video lecture provides a foundational overview, but the main emphasis is on you applying your skills to independently analyze and manipulate the data. This part of the project involves importing and inspecting multiple CSV files corresponding to different quarters of a year, each containing unique data sets.
As with the previous module, you’re encouraged to explore and find solutions on your own, simulating a real-world work environment where detailed instructions aren’t always available. This approach helps in refining your problem-solving and research skills, which are vital in any data analysis role.
However, remember that support is still available. If you encounter challenges or have questions, don’t hesitate to ask for help in the LinkedIn chat group.
Video Lecture
Exercises
As in the Pt. 1 of the Advanced Data Manipulation Project, some parts of the following exercises might go beyond the previously discussed topics of the course. Please let us know in the LinkedIn group in case you get stuck.
You may download the data sets used in this module here (see data attribution below):
Here are the exercises:
- Import the four quarterly data sets
allhlcn221.csv
,allhlcn222.csv
,allhlcn223.csv
, andallhlcn224.csv
. - Keep only the rows where
Area
is equal to “U.S. TOTAL”, andIndustry
is unequal to “10 Total, all industries”, “101 Goods-producing”, and “102 Service-providing”. - Merge the quarterly data into a single data set using a full join.
- Rename the monthly employment columns to month abbreviations (e.g. “Jan” = “January Employment”).
- Keep only
Industry
and the month columns. - Transform the data to long format. The long data should contain the columns
Industry
,name
, andvalue
. - Convert the
name
column to the factor class and specify the ordering of the factor levels. Hint: You may use thefactor()
function and thelevels
argument. - Draw a ggplot2 line plot with the month names on the x-axis and the employment values on the y-axis. Group and color the lines by
Industry
. Set thelinewidth
argument within thegeom_line()
function to 2. - Which industry employs the most people?
- What do you notice regarding the “Leisure and hospitality” industry? Hint: Check the variance of the line.
- Draw a ggplot2 boxplot with the employment values on the y-axis and a fill color based on
Industry
. Remove the x-axis text and ticks. - Is your observation regarding the “Leisure and hospitality”
Industry
confirmed? - Group your data based on
Industry
and calculate the grouped variance. Sort the output descendingly.
The solutions to these exercises can be found in a video at the bottom of this page.
R Code of This Lecture
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/" # Specify directory path tib_q1 <- read_csv2(str_c(my_path, # Import Q1 CSV file "allhlcn221.csv")) tib_q1 # Print Q1 tibble tib_q2 <- read_csv2(str_c(my_path, # Import Q2 CSV file "allhlcn222.csv")) tib_q2 # Print Q2 tibble tib_q3 <- read_csv2(str_c(my_path, # Import Q3 CSV file "allhlcn223.csv")) tib_q3 # Print Q3 tibble tib_q4 <- read_csv2(str_c(my_path, # Import Q4 CSV file "allhlcn224.csv")) tib_q4 # Print Q4 tibble View(tib_q1) # Inspect entire Q1 data View(tib_q2) # Inspect entire Q2 data
Exercise Solutions Video
Exercise Solutions Code
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/" # Specify directory path tib_q1 <- read_csv2(str_c(my_path, # Import Q1 CSV file "allhlcn221.csv")) tib_q1 # Print Q1 tibble tib_q2 <- read_csv2(str_c(my_path, # Import Q2 CSV file "allhlcn222.csv")) tib_q2 # Print Q2 tibble tib_q3 <- read_csv2(str_c(my_path, # Import Q3 CSV file "allhlcn223.csv")) tib_q3 # Print Q3 tibble tib_q4 <- read_csv2(str_c(my_path, # Import Q4 CSV file "allhlcn224.csv")) tib_q4 # Print Q4 tibble tib_q1 <- tib_q1 %>% # Filter rows of Q1 tibble filter(Area == "U.S. TOTAL", Industry != "10 Total, all industries", Industry != "101 Goods-producing", Industry != "102 Service-providing") tib_q1 # Print updated Q1 tibble tib_q2 <- tib_q2 %>% # Filter rows of Q1 tibble filter(Area == "U.S. TOTAL", Industry != "10 Total, all industries", Industry != "101 Goods-producing", Industry != "102 Service-providing") tib_q2 # Print updated Q1 tibble tib_q3 <- tib_q3 %>% # Filter rows of Q1 tibble filter(Area == "U.S. TOTAL", Industry != "10 Total, all industries", Industry != "101 Goods-producing", Industry != "102 Service-providing") tib_q3 # Print updated Q1 tibble tib_q4 <- tib_q4 %>% # Filter rows of Q1 tibble filter(Area == "U.S. TOTAL", Industry != "10 Total, all industries", Industry != "101 Goods-producing", Industry != "102 Service-providing") tib_q4 # Print updated Q1 tibble tib_all <- tib_q1 %>% # Join quarterly tibbles full_join(tib_q2, by = "Industry") %>% full_join(tib_q3, by = "Industry") %>% full_join(tib_q4, by = "Industry") tib_all # Print combined tibble tib_all <- tib_all %>% # Rename month columns rename("Jan" = "January Employment", "Feb" = "February Employment", "Mar" = "March Employment", "Apr" = "April Employment", "May" = "May Employment", "Jun" = "June Employment", "Jul" = "July Employment", "Aug" = "August Employment", "Sep" = "September Employment", "Oct" = "October Employment", "Nov" = "November Employment", "Dec" = "December Employment") tib_all # Print updated tibble tib_all <- tib_all %>% # Keep only certain columns select(Industry, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) tib_all # Print updated tibble tib_all_long <- tib_all %>% # Transform to long format pivot_longer(colnames(tib_all)[- 1]) tib_all_long # Print long tibble tib_all_long <- tib_all_long %>% # Convert to factor & set ordering mutate(name = factor(name, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))) tib_all_long # Print updated tibble tib_all_long %>% # Draw ggplot2 line plot ggplot(aes(x = name, y = value, group = Industry, col = Industry)) + geom_line(linewidth = 2) tib_all_long %>% # Draw ggplot2 boxplot ggplot(aes(y = value, fill = Industry)) + geom_boxplot() + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) tib_all_long %>% # Calculate grouped variance group_by(Industry) %>% summarize(Industry_Variance = var(value)) %>% arrange(desc(Industry_Variance))
Data Attribution
This module utilizes data obtained from the U.S. Bureau of Labor Statistics (BLS). The data sets allhlcn221.csv
, allhlcn222.csv
, allhlcn223.csv
, and allhlcn224.csv
, central to our exercises, are sourced from BLS’s extensive data repositories, specifically tailored to provide insights into employment statistics. We acknowledge the U.S. Bureau of Labor Statistics as the primary source of these data sets, which significantly contribute to the educational value of our course. For more detailed data and additional resources, please visit the U.S. Bureau of Labor Statistics website.
.
You can access the course overview page, timetable, and table of contents by clicking here.