Module 5 – Sampling Methods [Course Preview]

Welcome to Module 5, where we explore Sampling Methods. In this module, we will discuss the key differences between populations and samples, and delve into various sampling techniques such as simple random sampling, systematic sampling, stratified sampling, and cluster sampling.

We will also demonstrate how to apply these sampling methods in R, providing practical examples to reinforce your understanding of each approach.

Video Lectures

Presentation of This Lecture

You may download the presentation of this lecture by clicking here.

R Code

Please find the R code of this lecture below.

# install.packages("dplyr")                               # Install & load dplyr
library(dplyr)
 
data(starwars)                                            # Load starwars data set
head(starwars)                                            # Print head of starwars
 
set.seed(984362)                                          # Set random seed
 
 
starwars_srswor <- starwars %>%                           # SRS without replacement
  sample_n(size = 5)
starwars_srswor                                           # Print sampled data
 
starwars_srswr <- starwars %>%                            # SRS with replacement
  sample_n(size = 5,
           replace = TRUE)
starwars_srswr                                            # Print sampled data
 
starwars_srswr2 <- starwars %>%                           # Larger sample than population
  sample_n(size = 200,
           replace = TRUE)
starwars_srswr2                                           # Print sampled data
 
mean(starwars$height, na.rm = TRUE)                       # Means of population & sample
mean(starwars_srswr2$height, na.rm = TRUE)
 
 
starwars_syst <- starwars %>%                             # Systematic sampling
  slice(seq(sample(1:5, 1),                               # Random starting point
            nrow(starwars),                               # Total number of rows
            by = 5))                                      # Sampling interval
starwars_syst                                             # Print sampled data
 
 
table(starwars$sex)                                       # Frequency table of sex
 
starwars_strat <- starwars %>%                            # Stratified sampling
  group_by(sex) %>%
  sample_frac(size = 0.3)
starwars_strat                                            # Print sampled data
 
table(starwars_strat$sex)                                 # Frequency table of sex
 
 
starwars_clust <- starwars %>%                            # Cluster sampling
  filter(homeworld %in% sample(unique(homeworld),
                               size = 10))
starwars_clust                                            # Print sampled data

Exercises

Below are the exercises for this module. We’ll start with theoretical questions, followed by R programming exercises.

Theory Exercises

Here are the theoretical exercises. Keep in mind that each multiple-choice question has only one correct answer.

What is the primary purpose of sampling in statistics?

A) To examine every member of the population.
B) To make inferences about the population without analyzing every individual.
C) To reduce data variability.
D) To eliminate bias in the data collection process.

Which sampling method ensures every member of the population has an equal chance of being selected?

A) Stratified sampling
B) Systematic sampling
C) Cluster sampling
D) Simple random sampling

In systematic sampling, how are individuals selected?

A) Based on random selection from each subgroup.
B) By selecting every k-th individual from a list.
C) By randomly selecting entire clusters of the population.
D) By selecting the most accessible individuals.

Which of the following is a key disadvantage of cluster sampling?

A) It is cost-effective for geographically dispersed populations.
B) It may introduce higher variability if clusters are not homogeneous.
C) It requires detailed knowledge of the population to divide into strata.
D) It requires a complete list of the population.

Which sampling method divides the population into homogeneous subgroups and selects samples from each?

A) Systematic sampling
B) Cluster sampling
C) Stratified sampling
D) Simple random sampling

R Exercises

In the R exercises of this module, we will rely on a synthetic data set that we have created ourselves. Find the code for the data creation below:

set.seed(35867)                                            # Set seed for reproducibility
 
my_synt <- data.frame(ID = 1:200,                          # Create a self-defined data
  Age = sample(18:65, 200, replace = TRUE),
  Gender = sample(c("Male", "Female", "Other"), 200, replace = TRUE),
  Income = sample(20000:100000, 200, replace = TRUE),
  Department = sample(c("Sales", "HR", "IT", "Finance"), 200, replace = TRUE))
 
head(my_synt)                                              # Print first 6 rows
#   ID Age Gender Income Department
# 1  1  54  Other  37145         HR
# 2  2  45  Other  20272    Finance
# 3  3  50 Female  83288      Sales
# 4  4  35  Other  30455      Sales
# 5  5  65   Male  80422         IT
# 6  6  56   Male  31273         IT

Our data contains the following columns:

ID: A unique identifier for each individual (from 1 to 200).
Age: The age of the individual, randomly sampled between 18 and 65 years.
Gender: The gender of the individual, randomly assigned as “Male,” “Female,” or “Other.”
Income: The annual income of the individual, randomly sampled between 20,000 and 100,000 units.
Department: The department to which the individual belongs, randomly chosen from “Sales,” “HR,” “IT,” and “Finance.”

This data will be used to demonstrate various sampling techniques throughout the module.

Now, let’s move on to the exercises. Please find the R exercises for this module listed below.

Perform simple random sampling without replacement, selecting 15 rows from your data set.
Perform simple random sampling with replacement, selecting 25 rows from your data set.
Calculate the mean income of all individuals in your data set, and compare it with the mean income of a random sample of size 50 taken with replacement.
Implement systematic sampling on your data set. Select a random starting point and a sampling interval of 7.
Group your data by the Gender column and perform stratified sampling, selecting 40% of the observations from each gender group.
Perform cluster sampling on your data set. Randomly select 3 departments and filter for all observations that belong to those departments.

Exercise Solutions

The solutions to the exercises will be published on the Monday following the week in which this module was conducted.

Further Resources

You can access the course overview page, timetable, and table of contents by clicking here.