Small Area Estimation in R (Example) | Unit-Level and Area-Level Model


Here, we show you how to calculate the two best known small area estimation (SAE) models in the R programming language. We calculate the basic Battese-Harter-Fuller (BHF) model, which is a unit-level model, and the Fay-Herriot (FH) model, which is an area-level model. For that, we use the functions in the sae package and the applications shown in the associated R Journal article.

This is our structure:

If you don’t know what terms like Battese-Harter-Fuller model mean, take a look at our article about Small Area Estimation Techniques first. When you are ready: Let’s do some small area estimation!


Battese-Harter-Fuller (BHF) Model

We illustrate the BHF model using county-level data on corn and soy bean production from the sae package and Example 4 in the associated R Journal article. It is also the application used in the original article by Battese, Harter, and Fuller (1988), after which the BHF model is named.

Let’s load the package.

# Install the package if needed
if (!require('sae', quietly = TRUE)) { install.packages('sae') } 
library('sae') # Load package 'sae'

We use two datasets for the model. Datasets cornsoybean and cornsoybeanmeans contain survey and satellite information on the corn and soy bean production in 37 sampled segments of 12 counties in the U.S. state Iowa.

With the data, we want to estimate the county means of corn crop hectares. In small area terminology: The 12 counties are the domains of interest, the county means of corn crop hectares are the parameters of interest.

To calculate the BHF model we need unit-level survey information on our variable of interest. In this application: The number of corn hectares per sampled segment.

Furthermore, we need auxiliary information. We are interested in domain means. The mean is a linear function in its input data. For linear functions, we need to know the domain sizes. In this application that means: We need to know how many segments there are per county. For all auxiliary variables, we need to have access to the domain means and unit-level information for each unit sampled in the survey.

We start with the survey data.

?cornsoybean # Documentation 
cornsoybean <- cornsoybean[-33, ] # Exclude an outlier 
str(cornsoybean) # Structure 
# 'data.frame':	36 obs. of  5 variables:
# $ County     : int  1 2 3 4 4 5 5 5 6 6 ...
# $ CornHec    : num  165.8 96.3 76.1 185.3 116.4 ...
# $ SoyBeansHec: num  8.09 106.03 103.6 6.47 63.82 ...
# $ CornPix    : int  374 209 253 432 367 361 288 369 206 316 ...
# $ SoyBeansPix: int  55 218 250 96 178 137 206 165 218 221 ...

The unit-level survey sample data cornsoybean consists of 37 observations. As in the original application, observation 33 is identified as an outlier and excluded. Each of the remaining 36 lines of data contains information on a sampled segment of a County. From the survey, we have information on variables CornHec and SoyBeansHec, the number of hectares with corn and soy beans in the sampled segments.

cornsoybean is a unit-level dataset as the segments are the sampling units of the survey. Variables CornPix and SoyBeansPix denote the number of corn and soy bean pixels per segment, gathered from additional satellite data.

To calculate the BHF model, we additionally need to know the number of segments per domain, it corresponds to the domain sizes. Furthermore, we need the domain means of the auxiliary variables. The information is available in dataset cornsoybeanmeans, gathered from satellite data.

?cornsoybeanmeans     # Documentation 
str(cornsoybeanmeans) # Structure 
# 'data.frame':	12 obs. of  6 variables:
# $ CountyIndex          : int  1 2 3 4 5 6 7 8 9 10 ...
# $ CountyName           : Factor w/ 12 levels "CerroGordo","Franklin",..: 1 3 11 6 2 8 10 12 9 4 ...
# $ SampSegments         : int  1 1 1 2 3 3 3 3 4 5 ...
# $ PopnSegments         : num  545 566 394 424 564 570 402 567 687 569 ...
# $ MeanCornPixPerSeg    : num  295 300 290 291 318 ...
# $ MeanSoyBeansPixPerSeg: num  190 197 205 220 188 ...

From cornsoybeanmeans we have the mean number of corn pixels MeanCornPixPerSeg and soy bean pixels MeanSoyBeansPixPerSeg per segment and the total number of segments PopnSegments for each of the 12 counties.

We have all the data we need. Let’s calculate the BHF model.

set.seed(123) # Seed for bootstrap iterations
?pbmseBHF # Documentation
mod_BHF <- pbmseBHF(formula  = CornHec ~ CornPix + SoyBeansPix,
                    dom      = County, 
                    meanxpop = cornsoybeanmeans[, c("CountyIndex", 
                    popnsize = cornsoybeanmeans[, c("CountyIndex", 
                    method   = "REML", 
                    data     = cornsoybean,
                    B        = 200)
# formula:  CornHec is the dependent variable, 
#           CornPix and SoyBeansPix are the auxiliary variables.
# dom:      Variable County identifies the domains of interest.
# meanxpop: County means of corn and soybean pixels per segment.
# popnsize: Domain sizes, the number of segments per county.
# method:   Method used for estimating the model parameters, 
#           REML is restricted maximum likelihood.
# B:        Number of bootstrap replicates. Default is 50.

The BHF model returns a list. With $est$eblup, we can extract the Empirical Best Linear Unbiased Predictors (EBLUPs) of the model. With $mse, we can extract the estimated Mean Squared Error (MSE) of the EBLUPs. The MSE is estimated using a parametric bootstrap procedure. To make the bootstrap results reproducible, we set a seed.

The MSE is an absolute measure. Often, relative measures give us a better picture of precision. We therefore calculate the coefficient of variation (CV), CV = root MSE of estimated value / estimated value * 100.

BHF_CV <- 100 * sqrt(mod_BHF$mse$mse) / mod_BHF$est$eblup$eblup

Finally, the model returns the following results.

data.frame(County_name  = cornsoybeanmeans$CountyName,
           Sample_Size  = mod_BHF$est$eblup$sampsize,
           BHF_EBLUP    = round(mod_BHF$est$eblup$eblup, digits = 2), 
           BHF_CV       = round(BHF_CV, digits = 2))
#    County_name Sample_Size BHF_EBLUP BHF_CV
# 1   CerroGordo           1    122.20   8.07
# 2     Hamilton           1    126.23   7.83
# 3        Worth           1    106.66   9.33
# 4     Humboldt           2    108.42   7.60
# 5     Franklin           3    144.31   4.88
# 6   Pocahontas           3    112.16   6.02
# 7    Winnebago           3    112.78   5.95
# 8       Wright           3    122.00   5.70
# 9      Webster           4    115.34   4.81
# 10     Hancock           5    124.41   4.50
# 11     Kossuth           5    106.89   4.53
# 12      Hardin           5    143.03   3.50

Congratulations, we have just calculated a BHF model! With the BHF model we predicted the mean number of corn hectares in 12 counties of U.S. state Iowa using satellite data as auxiliary information, see column BHF_EBLUP. The predictions range between 106.66 and 144.31. The corresponding coefficient of variation (CV) shows us the estimated relative dispersion of these predictions.

Fay-Herriot (FH) Model

Detailed unit-level survey information carries the risk that data users could identify individuals in the data. We call that disclosure risk. Therefore, especially when we are interested in detailed regional information, we usually just get access to aggregated survey information. In that case, we cannot calculate the Battese-Harter-Fuller model. But, we can calculate the Fay-Herriot (FH) model.

We illustrate the FH model using a dataset on consumer expenditures of fresh whole milk from the sae package and Example 1 in the associated R Journal article. The data is also used in the articles Arora and Lahiri (1997) and You and Chapman (2006).

If not done yet, load the package.

# Install the package if needed
if (!require('sae', quietly = TRUE)) { install.packages('sae') } 
library('sae') # Load package 'sae'

We take a look at dataset milk.

?milk     # Documentation 
str(milk) # Structure 
# 'data.frame':	43 obs. of  6 variables:
# $ SmallArea: int  1 2 3 4 5 6 7 8 9 10 ...
# $ ni       : int  191 633 597 221 195 191 183 188 204 188 ...
# $ yi       : num  1.099 1.075 1.105 0.628 0.753 ...
# $ SD       : num  0.163 0.08 0.083 0.109 0.119 0.141 0.202 0.127 0.168 0.178 ...
# $ CV       : num  0.148 0.074 0.075 0.174 0.158 0.144 0.161 0.116 0.12 0.131 ...
# $ MajorArea: int  1 1 1 1 1 1 1 2 2 2 ...

Data milk is based on survey information from the U.S. Bureau of Labor Statistics. They estimated weekly consumer expenditures on different goods, like fresh whole milk in 1989, by means of a survey sample. Based on the survey sample, they calculated direct estimates on average weekly fresh whole milk consumption for 43 publication areas across the U.S..

With str(milk), we see that the data consists of 43 observations, one for each publication area. Variable SmallArea is the identifier of the publication areas. ni is the survey sample size in each area, yi are survey direct estimates on average weekly fresh whole milk consumption in 1989. The direct estimates are design-unbiased, the variance is a measure of their precision. Variable SD the standard deviation (SD) of the direct estimates. The standard deviation is the square root of the variance.

Furthermore, we see the coefficient of variation (CV) of the estimates. We can also calculate it by the following line of code.

milk$SD / milk$yi

When taking a look at the range of the CVs, we see that the direct estimates are quite volatile.

range(milk$CV * 100)
# [1] 7.4 34.1

Therefore, it is quite reasonable that the publication areas are called SmallArea in data milk.

Let’s see whether we can get more precise estimates by applying model-based small area models. The survey data is only available at the area-level in the form of direct estimates and corresponding variances. We can therefore only calculate an area-level small area model like the Fay-Herriot model.

For that, we need auxiliary information. More precisely, we need information at the level of the publication areas. In this example, we only consider one auxiliary variable. MajorArea is a variable clustering several publication areas to a major area. You and Chapman (2006) discuss how to form these major areas.

We have all the data we need. Let’s calculate the FH model.

?mseFH # Documentation
mod_FH <- mseFH(formula   = milk$yi ~ as.factor(milk$MajorArea), 
                vardir    = milk$SD^2,
                method    = "REML",
                MAXITER   = 100,
                PRECISION = 10^(-4),
                B         = 0)
# formula:   Direct survey estimates milk$yi are the dependent variable, 
#            as.factor(milk$MajorArea) is the auxiliary variable.
# vardir:    Variance of the direct estimates.
# method:    Method used for estimating the model parameters, 
#            REML is restricted maximum likelihood.
# MAXITER:   Maximum number of iterations for the Fisher-Scoring algorithm.
# PRECISION: Convergence tolerance limit for the Fisher-scoring algorithm. 
# B:         Number of bootstrap replicates for goodness-of-fit measures.

The model output is a list. $est$eblup gives us the Empirical Best Linear Unbiased Predictors (EBLUPs) of the model. The estimated Mean Squared Error (MSE) of the EBLUPs can be seen using $mse. In the FH model, we can approximate the MSE analytically and do not need a bootstrap procedure.

We can calculate the coefficient of variation (CV), CV = root MSE of estimated value / estimated value * 100, of the EBLUPs.

FH_CV <- 100 * sqrt(mod_FH$mse) / mod_FH$est$eblup

We take a look at the model results for the first 6 publication areas.

head(data.frame(Publ_Area    = milk$SmallArea,
                Sample_Size  = milk$ni,
                Dir          = round(milk$yi, digits = 2), 
                Dir_CV       = round(milk$CV * 100, digits = 2), 
                FH_EBLUP     = round(mod_FH$est$eblup, digits = 2), 
                FH_CV        = round(FH_CV, digits = 2)))
#   Publ_Area Sample_Size  Dir Dir_CV FH_EBLUP FH_CV
# 1         1         191 1.10   14.8     1.02 11.35
# 2         2         633 1.07    7.4     1.05  7.00
# 3         3         597 1.10    7.5     1.07  7.07
# 4         4         221 0.63   17.4     0.76 12.15
# 5         5         195 0.75   15.8     0.85 11.57
# 6         6         191 0.98   14.4     0.97 11.09

This looks quite nice. The CV of the EBLUPs is lower than the CV of the direct estimators. It ranges from 7-17.5%. Good job!

# 6.996924 17.491779

To summarize, we calculated a FH model to predict the average weekly fresh whole milk consumption for 43 publication areas across the U.S. The direct estimates of the U.S. Bureau of Labor Statistics for these areas were associated with high coefficients of variation. With the FH model, we got more precise estimates.


Video & Further Resources

We saw two exemplary applications of the Battese-Harter-Fuller and Fay-Herriot model in R.

There are many more topics on small area estimation! You might want to know how to compare different candidate models and validate how well a chosen model fits your data. Check out the books Small Area Estimation and A Course on Small Area Estimation and Mixed Models.

The BHF and FH models are two standard SAE models. For certain application, you might want to include additional features into the modelling process. Maybe you want to robustify the estimation, consider measurement errors in the auxiliary information, or are interested in multivariate modelling.

Luckily, there is an ever-growing number of R packages designed for different small area model types, available on CRAN: BayesSAE, geoSAE, hbsae, JoSAE, maSAE, mcmcsae, msae, msaeDB, msaeRB, msaenet, NSAE, robustsae, rsae, sae, saeBest, saeeb, saeHB, saekernel, saeME, saemix, saeRobust, saery, saeSim, SAEval, zipsae.

If you want to see more applications of small area models for poverty mapping, health, and income estimation, take a look at the following YouTube video from the UNStats channel.


Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.

YouTube Content Consent Button Thumbnail

YouTube privacy policy

If you accept this notice, your choice will be saved and the page will refresh.


You may also be interested in the following articles on Statistics Globe:


In this article, we have shown an application of the Battese-Harter-Fuller model and the Fay-Herriot model in the R programming language. Use the comment section to get in touch for any questions.


Anna-Lena Wölwer Survey Statistician & R Programmer

This page was created in collaboration with Anna-Lena Wölwer. Have a look at Anna-Lena’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.


Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.