Module 5 – Sampling Methods [Course Preview]
Welcome to Module 5, where we explore Sampling Methods. In this module, we will discuss the key differences between populations and samples, and delve into various sampling techniques such as simple random sampling, systematic sampling, stratified sampling, and cluster sampling.
We will also demonstrate how to apply these sampling methods in R, providing practical examples to reinforce your understanding of each approach.
Video Lectures
Presentation of This Lecture
You may download the presentation of this lecture by clicking here.
R Code
Please find the R code of this lecture below.
# install.packages("dplyr") # Install & load dplyr library(dplyr) data(starwars) # Load starwars data set head(starwars) # Print head of starwars set.seed(984362) # Set random seed starwars_srswor <- starwars %>% # SRS without replacement sample_n(size = 5) starwars_srswor # Print sampled data starwars_srswr <- starwars %>% # SRS with replacement sample_n(size = 5, replace = TRUE) starwars_srswr # Print sampled data starwars_srswr2 <- starwars %>% # Larger sample than population sample_n(size = 200, replace = TRUE) starwars_srswr2 # Print sampled data mean(starwars$height, na.rm = TRUE) # Means of population & sample mean(starwars_srswr2$height, na.rm = TRUE) starwars_syst <- starwars %>% # Systematic sampling slice(seq(sample(1:5, 1), # Random starting point nrow(starwars), # Total number of rows by = 5)) # Sampling interval starwars_syst # Print sampled data table(starwars$sex) # Frequency table of sex starwars_strat <- starwars %>% # Stratified sampling group_by(sex) %>% sample_frac(size = 0.3) starwars_strat # Print sampled data table(starwars_strat$sex) # Frequency table of sex starwars_clust <- starwars %>% # Cluster sampling filter(homeworld %in% sample(unique(homeworld), size = 10)) starwars_clust # Print sampled data
Exercises
Below are the exercises for this module. We’ll start with theoretical questions, followed by R programming exercises.
Theory Exercises
Here are the theoretical exercises. Keep in mind that each multiple-choice question has only one correct answer.
- What is the primary purpose of sampling in statistics?
- A) To examine every member of the population.
- B) To make inferences about the population without analyzing every individual.
- C) To reduce data variability.
- D) To eliminate bias in the data collection process.
- Which sampling method ensures every member of the population has an equal chance of being selected?
- A) Stratified sampling
- B) Systematic sampling
- C) Cluster sampling
- D) Simple random sampling
- In systematic sampling, how are individuals selected?
- A) Based on random selection from each subgroup.
- B) By selecting every k-th individual from a list.
- C) By randomly selecting entire clusters of the population.
- D) By selecting the most accessible individuals.
- Which of the following is a key disadvantage of cluster sampling?
- A) It is cost-effective for geographically dispersed populations.
- B) It may introduce higher variability if clusters are not homogeneous.
- C) It requires detailed knowledge of the population to divide into strata.
- D) It requires a complete list of the population.
- Which sampling method divides the population into homogeneous subgroups and selects samples from each?
- A) Systematic sampling
- B) Cluster sampling
- C) Stratified sampling
- D) Simple random sampling
R Exercises
In the R exercises of this module, we will rely on a synthetic data set that we have created ourselves. Find the code for the data creation below:
set.seed(35867) # Set seed for reproducibility my_synt <- data.frame(ID = 1:200, # Create a self-defined data Age = sample(18:65, 200, replace = TRUE), Gender = sample(c("Male", "Female", "Other"), 200, replace = TRUE), Income = sample(20000:100000, 200, replace = TRUE), Department = sample(c("Sales", "HR", "IT", "Finance"), 200, replace = TRUE)) head(my_synt) # Print first 6 rows # ID Age Gender Income Department # 1 1 54 Other 37145 HR # 2 2 45 Other 20272 Finance # 3 3 50 Female 83288 Sales # 4 4 35 Other 30455 Sales # 5 5 65 Male 80422 IT # 6 6 56 Male 31273 IT
Our data contains the following columns:
- ID: A unique identifier for each individual (from 1 to 200).
- Age: The age of the individual, randomly sampled between 18 and 65 years.
- Gender: The gender of the individual, randomly assigned as “Male,” “Female,” or “Other.”
- Income: The annual income of the individual, randomly sampled between 20,000 and 100,000 units.
- Department: The department to which the individual belongs, randomly chosen from “Sales,” “HR,” “IT,” and “Finance.”
This data will be used to demonstrate various sampling techniques throughout the module.
Now, let’s move on to the exercises. Please find the R exercises for this module listed below.
- Perform simple random sampling without replacement, selecting 15 rows from your data set.
- Perform simple random sampling with replacement, selecting 25 rows from your data set.
- Calculate the mean income of all individuals in your data set, and compare it with the mean income of a random sample of size 50 taken with replacement.
- Implement systematic sampling on your data set. Select a random starting point and a sampling interval of 7.
- Group your data by the
Gendercolumn and perform stratified sampling, selecting 40% of the observations from each gender group. - Perform cluster sampling on your data set. Randomly select 3 departments and filter for all observations that belong to those departments.
Exercise Solutions
The solutions to the exercises will be published on the Monday following the week in which this module was conducted.
Further Resources
- Sampling by Wikipedia
- Stratified sampling by Wikipedia
- Cluster sampling by Wikipedia
- Sampling Theory by Statistics Globe
- sample() Function in R by Statistics Globe
- Random Numbers in R by Statistics Globe
.
You can access the course overview page, timetable, and table of contents by clicking here.





