# Small Area Estimation in R (Example) | Unit-Level and Area-Level Model

Here, we show you how to calculate the two best known **small area estimation (SAE) models** in the R programming language. We calculate the basic **Battese-Harter-Fuller (BHF) model**, which is a **unit-level model**, and the **Fay-Herriot (FH) model**, which is an **area-level model**. For that, we use the functions in the sae package and the applications shown in the associated R Journal article.

This is our structure:

If you don’t know what terms like **Battese-Harter-Fuller model** mean, take a look at our article about Small Area Estimation Techniques first. When you are ready: Let’s do some small area estimation!

## Battese-Harter-Fuller (BHF) Model

We illustrate the BHF model using county-level data on corn and soy bean production from the sae package and Example 4 in the associated R Journal article. It is also the application used in the original article by Battese, Harter, and Fuller (1988), after which the BHF model is named.

Let’s load the package.

# Install the package if needed if (!require('sae', quietly = TRUE)) { install.packages('sae') } library('sae') # Load package 'sae' |

# Install the package if needed if (!require('sae', quietly = TRUE)) { install.packages('sae') } library('sae') # Load package 'sae'

We use two datasets for the model. Datasets *cornsoybean* and *cornsoybeanmeans* contain survey and satellite information on the corn and soy bean production in 37 sampled segments of 12 counties in the U.S. state Iowa.

With the data, we want to estimate the **county means of corn crop hectares**. In small area terminology: The 12 counties are the **domains of interest**, the county means of corn crop hectares are the **parameters of interest**.

To calculate the BHF model we need **unit-level survey information** on our variable of interest. In this application: The number of corn hectares per sampled segment.

Furthermore, we need auxiliary information. We are interested in domain means. The **mean** is a **linear function** in its input data. For linear functions, we need to know the domain sizes. In this application that means: We need to know how many segments there are per county. For all auxiliary variables, we need to have access to the domain means and unit-level information for each unit sampled in the survey.

We start with the survey data.

data("cornsoybean") ?cornsoybean # Documentation cornsoybean <- cornsoybean[-33, ] # Exclude an outlier str(cornsoybean) # Structure # 'data.frame': 36 obs. of 5 variables: # $ County : int 1 2 3 4 4 5 5 5 6 6 ... # $ CornHec : num 165.8 96.3 76.1 185.3 116.4 ... # $ SoyBeansHec: num 8.09 106.03 103.6 6.47 63.82 ... # $ CornPix : int 374 209 253 432 367 361 288 369 206 316 ... # $ SoyBeansPix: int 55 218 250 96 178 137 206 165 218 221 ... |

data("cornsoybean") ?cornsoybean # Documentation cornsoybean <- cornsoybean[-33, ] # Exclude an outlier str(cornsoybean) # Structure # 'data.frame': 36 obs. of 5 variables: # $ County : int 1 2 3 4 4 5 5 5 6 6 ... # $ CornHec : num 165.8 96.3 76.1 185.3 116.4 ... # $ SoyBeansHec: num 8.09 106.03 103.6 6.47 63.82 ... # $ CornPix : int 374 209 253 432 367 361 288 369 206 316 ... # $ SoyBeansPix: int 55 218 250 96 178 137 206 165 218 221 ...

The unit-level survey sample data *cornsoybean* consists of 37 observations. As in the original application, observation 33 is identified as an outlier and excluded. Each of the remaining 36 lines of data contains information on a sampled segment of a *County*. From the survey, we have information on variables *CornHec* and *SoyBeansHec*, the number of hectares with corn and soy beans in the sampled segments.

*cornsoybean* is a **unit-level dataset** as the segments are the sampling units of the survey. Variables *CornPix* and *SoyBeansPix* denote the number of corn and soy bean pixels per segment, gathered from additional satellite data.

To calculate the BHF model, we additionally need to know the number of segments per domain, it corresponds to the domain sizes. Furthermore, we need the domain means of the **auxiliary variables**. The information is available in dataset *cornsoybeanmeans*, gathered from satellite data.

data("cornsoybeanmeans") ?cornsoybeanmeans # Documentation str(cornsoybeanmeans) # Structure # 'data.frame': 12 obs. of 6 variables: # $ CountyIndex : int 1 2 3 4 5 6 7 8 9 10 ... # $ CountyName : Factor w/ 12 levels "CerroGordo","Franklin",..: 1 3 11 6 2 8 10 12 9 4 ... # $ SampSegments : int 1 1 1 2 3 3 3 3 4 5 ... # $ PopnSegments : num 545 566 394 424 564 570 402 567 687 569 ... # $ MeanCornPixPerSeg : num 295 300 290 291 318 ... # $ MeanSoyBeansPixPerSeg: num 190 197 205 220 188 ... |

data("cornsoybeanmeans") ?cornsoybeanmeans # Documentation str(cornsoybeanmeans) # Structure # 'data.frame': 12 obs. of 6 variables: # $ CountyIndex : int 1 2 3 4 5 6 7 8 9 10 ... # $ CountyName : Factor w/ 12 levels "CerroGordo","Franklin",..: 1 3 11 6 2 8 10 12 9 4 ... # $ SampSegments : int 1 1 1 2 3 3 3 3 4 5 ... # $ PopnSegments : num 545 566 394 424 564 570 402 567 687 569 ... # $ MeanCornPixPerSeg : num 295 300 290 291 318 ... # $ MeanSoyBeansPixPerSeg: num 190 197 205 220 188 ...

From *cornsoybeanmeans* we have the mean number of corn pixels *MeanCornPixPerSeg* and soy bean pixels *MeanSoyBeansPixPerSeg* per segment and the total number of segments *PopnSegments* for each of the 12 counties.

We have all the data we need. Let’s calculate the BHF model.

set.seed(123) # Seed for bootstrap iterations ?pbmseBHF # Documentation mod_BHF <- pbmseBHF(formula = CornHec ~ CornPix + SoyBeansPix, dom = County, meanxpop = cornsoybeanmeans[, c("CountyIndex", "MeanCornPixPerSeg", "MeanSoyBeansPixPerSeg")], popnsize = cornsoybeanmeans[, c("CountyIndex", "PopnSegments")], method = "REML", data = cornsoybean, B = 200) # formula: CornHec is the dependent variable, # CornPix and SoyBeansPix are the auxiliary variables. # dom: Variable County identifies the domains of interest. # meanxpop: County means of corn and soybean pixels per segment. # popnsize: Domain sizes, the number of segments per county. # method: Method used for estimating the model parameters, # REML is restricted maximum likelihood. # B: Number of bootstrap replicates. Default is 50. |

set.seed(123) # Seed for bootstrap iterations ?pbmseBHF # Documentation mod_BHF <- pbmseBHF(formula = CornHec ~ CornPix + SoyBeansPix, dom = County, meanxpop = cornsoybeanmeans[, c("CountyIndex", "MeanCornPixPerSeg", "MeanSoyBeansPixPerSeg")], popnsize = cornsoybeanmeans[, c("CountyIndex", "PopnSegments")], method = "REML", data = cornsoybean, B = 200) # formula: CornHec is the dependent variable, # CornPix and SoyBeansPix are the auxiliary variables. # dom: Variable County identifies the domains of interest. # meanxpop: County means of corn and soybean pixels per segment. # popnsize: Domain sizes, the number of segments per county. # method: Method used for estimating the model parameters, # REML is restricted maximum likelihood. # B: Number of bootstrap replicates. Default is 50.

The BHF model returns a list. With *$est$eblup*, we can extract the **Empirical Best Linear Unbiased Predictors (EBLUPs)** of the model. With *$mse*, we can extract the estimated **Mean Squared Error (MSE)** of the EBLUPs. The MSE is estimated using a **parametric bootstrap procedure**. To make the bootstrap results reproducible, we set a seed.

The MSE is an absolute measure. Often, relative measures give us a better picture of precision. We therefore calculate the **coefficient of variation (CV)**, CV = root MSE of estimated value / estimated value * 100.

BHF_CV <- 100 * sqrt(mod_BHF$mse$mse) / mod_BHF$est$eblup$eblup |

BHF_CV <- 100 * sqrt(mod_BHF$mse$mse) / mod_BHF$est$eblup$eblup

Finally, the model returns the following results.

data.frame(County_name = cornsoybeanmeans$CountyName, Sample_Size = mod_BHF$est$eblup$sampsize, BHF_EBLUP = round(mod_BHF$est$eblup$eblup, digits = 2), BHF_CV = round(BHF_CV, digits = 2)) # County_name Sample_Size BHF_EBLUP BHF_CV # 1 CerroGordo 1 122.20 8.07 # 2 Hamilton 1 126.23 7.83 # 3 Worth 1 106.66 9.33 # 4 Humboldt 2 108.42 7.60 # 5 Franklin 3 144.31 4.88 # 6 Pocahontas 3 112.16 6.02 # 7 Winnebago 3 112.78 5.95 # 8 Wright 3 122.00 5.70 # 9 Webster 4 115.34 4.81 # 10 Hancock 5 124.41 4.50 # 11 Kossuth 5 106.89 4.53 # 12 Hardin 5 143.03 3.50 |

data.frame(County_name = cornsoybeanmeans$CountyName, Sample_Size = mod_BHF$est$eblup$sampsize, BHF_EBLUP = round(mod_BHF$est$eblup$eblup, digits = 2), BHF_CV = round(BHF_CV, digits = 2)) # County_name Sample_Size BHF_EBLUP BHF_CV # 1 CerroGordo 1 122.20 8.07 # 2 Hamilton 1 126.23 7.83 # 3 Worth 1 106.66 9.33 # 4 Humboldt 2 108.42 7.60 # 5 Franklin 3 144.31 4.88 # 6 Pocahontas 3 112.16 6.02 # 7 Winnebago 3 112.78 5.95 # 8 Wright 3 122.00 5.70 # 9 Webster 4 115.34 4.81 # 10 Hancock 5 124.41 4.50 # 11 Kossuth 5 106.89 4.53 # 12 Hardin 5 143.03 3.50

Congratulations, we have just calculated a BHF model! With the BHF model we predicted the mean number of corn hectares in 12 counties of U.S. state Iowa using satellite data as auxiliary information, see column *BHF_EBLUP*. The predictions range between 106.66 and 144.31. The corresponding coefficient of variation (CV) shows us the estimated relative dispersion of these predictions.

## Fay-Herriot (FH) Model

Detailed unit-level survey information carries the risk that data users could identify individuals in the data. We call that **disclosure risk**. Therefore, especially when we are interested in detailed regional information, we usually just get access to **aggregated survey information**. In that case, we cannot calculate the Battese-Harter-Fuller model. But, we can calculate the Fay-Herriot (FH) model.

We illustrate the FH model using a dataset on consumer expenditures of fresh whole milk from the sae package and Example 1 in the associated R Journal article. The data is also used in the articles Arora and Lahiri (1997) and You and Chapman (2006).

If not done yet, load the package.

# Install the package if needed if (!require('sae', quietly = TRUE)) { install.packages('sae') } library('sae') # Load package 'sae' |

We take a look at dataset *milk*.

data("milk") ?milk # Documentation str(milk) # Structure # 'data.frame': 43 obs. of 6 variables: # $ SmallArea: int 1 2 3 4 5 6 7 8 9 10 ... # $ ni : int 191 633 597 221 195 191 183 188 204 188 ... # $ yi : num 1.099 1.075 1.105 0.628 0.753 ... # $ SD : num 0.163 0.08 0.083 0.109 0.119 0.141 0.202 0.127 0.168 0.178 ... # $ CV : num 0.148 0.074 0.075 0.174 0.158 0.144 0.161 0.116 0.12 0.131 ... # $ MajorArea: int 1 1 1 1 1 1 1 2 2 2 ... |

data("milk") ?milk # Documentation str(milk) # Structure # 'data.frame': 43 obs. of 6 variables: # $ SmallArea: int 1 2 3 4 5 6 7 8 9 10 ... # $ ni : int 191 633 597 221 195 191 183 188 204 188 ... # $ yi : num 1.099 1.075 1.105 0.628 0.753 ... # $ SD : num 0.163 0.08 0.083 0.109 0.119 0.141 0.202 0.127 0.168 0.178 ... # $ CV : num 0.148 0.074 0.075 0.174 0.158 0.144 0.161 0.116 0.12 0.131 ... # $ MajorArea: int 1 1 1 1 1 1 1 2 2 2 ...

Data *milk* is based on survey information from the U.S. Bureau of Labor Statistics. They estimated weekly consumer expenditures on different goods, like fresh whole milk in 1989, by means of a survey sample. Based on the survey sample, they calculated **direct estimates** on average weekly fresh whole milk consumption for 43 publication areas across the U.S..

With *str(milk)*, we see that the data consists of 43 observations, one for each publication area. Variable *SmallArea* is the identifier of the publication areas. *ni* is the survey sample size in each area, *yi* are survey direct estimates on average weekly fresh whole milk consumption in 1989. The direct estimates are design-unbiased, the variance is a measure of their precision. Variable *SD* the **standard deviation (SD)** of the direct estimates. The standard deviation is the square root of the variance.

Furthermore, we see the **coefficient of variation (CV)** of the estimates. We can also calculate it by the following line of code.

`milk$SD / milk$yi` |

milk$SD / milk$yi

When taking a look at the range of the CVs, we see that the direct estimates are quite volatile.

range(milk$CV * 100) # [1] 7.4 34.1 |

range(milk$CV * 100) # [1] 7.4 34.1

Therefore, it is quite reasonable that the publication areas are called *SmallArea* in data *milk*.

Let’s see whether we can get more precise estimates by applying model-based small area models. The survey data is only available at the area-level in the form of direct estimates and corresponding variances. We can therefore only calculate an **area-level small area model** like the Fay-Herriot model.

For that, we need **auxiliary information**. More precisely, we need information at the level of the publication areas. In this example, we only consider one auxiliary variable. *MajorArea* is a variable clustering several publication areas to a major area. You and Chapman (2006) discuss how to form these major areas.

We have all the data we need. Let’s calculate the FH model.

?mseFH # Documentation mod_FH <- mseFH(formula = milk$yi ~ as.factor(milk$MajorArea), vardir = milk$SD^2, method = "REML", MAXITER = 100, PRECISION = 10^(-4), B = 0) # formula: Direct survey estimates milk$yi are the dependent variable, # as.factor(milk$MajorArea) is the auxiliary variable. # vardir: Variance of the direct estimates. # method: Method used for estimating the model parameters, # REML is restricted maximum likelihood. # MAXITER: Maximum number of iterations for the Fisher-Scoring algorithm. # PRECISION: Convergence tolerance limit for the Fisher-scoring algorithm. # B: Number of bootstrap replicates for goodness-of-fit measures. |

?mseFH # Documentation mod_FH <- mseFH(formula = milk$yi ~ as.factor(milk$MajorArea), vardir = milk$SD^2, method = "REML", MAXITER = 100, PRECISION = 10^(-4), B = 0) # formula: Direct survey estimates milk$yi are the dependent variable, # as.factor(milk$MajorArea) is the auxiliary variable. # vardir: Variance of the direct estimates. # method: Method used for estimating the model parameters, # REML is restricted maximum likelihood. # MAXITER: Maximum number of iterations for the Fisher-Scoring algorithm. # PRECISION: Convergence tolerance limit for the Fisher-scoring algorithm. # B: Number of bootstrap replicates for goodness-of-fit measures.

The model output is a list. *$est$eblup* gives us the **Empirical Best Linear Unbiased Predictors (EBLUPs)** of the model. The estimated **Mean Squared Error (MSE) **of the EBLUPs can be seen using *$mse*. In the FH model, we can approximate the MSE analytically and do not need a bootstrap procedure.

We can calculate the **coefficient of variation (CV)**, CV = root MSE of estimated value / estimated value * 100, of the EBLUPs.

FH_CV <- 100 * sqrt(mod_FH$mse) / mod_FH$est$eblup |

FH_CV <- 100 * sqrt(mod_FH$mse) / mod_FH$est$eblup

We take a look at the model results for the first 6 publication areas.

head(data.frame(Publ_Area = milk$SmallArea, Sample_Size = milk$ni, Dir = round(milk$yi, digits = 2), Dir_CV = round(milk$CV * 100, digits = 2), FH_EBLUP = round(mod_FH$est$eblup, digits = 2), FH_CV = round(FH_CV, digits = 2))) # Publ_Area Sample_Size Dir Dir_CV FH_EBLUP FH_CV # 1 1 191 1.10 14.8 1.02 11.35 # 2 2 633 1.07 7.4 1.05 7.00 # 3 3 597 1.10 7.5 1.07 7.07 # 4 4 221 0.63 17.4 0.76 12.15 # 5 5 195 0.75 15.8 0.85 11.57 # 6 6 191 0.98 14.4 0.97 11.09 |

head(data.frame(Publ_Area = milk$SmallArea, Sample_Size = milk$ni, Dir = round(milk$yi, digits = 2), Dir_CV = round(milk$CV * 100, digits = 2), FH_EBLUP = round(mod_FH$est$eblup, digits = 2), FH_CV = round(FH_CV, digits = 2))) # Publ_Area Sample_Size Dir Dir_CV FH_EBLUP FH_CV # 1 1 191 1.10 14.8 1.02 11.35 # 2 2 633 1.07 7.4 1.05 7.00 # 3 3 597 1.10 7.5 1.07 7.07 # 4 4 221 0.63 17.4 0.76 12.15 # 5 5 195 0.75 15.8 0.85 11.57 # 6 6 191 0.98 14.4 0.97 11.09

This looks quite nice. The CV of the EBLUPs is lower than the CV of the direct estimators. It ranges from 7-17.5%. Good job!

range(FH_CV) # 6.996924 17.491779 |

range(FH_CV) # 6.996924 17.491779

To summarize, we calculated a FH model to predict the average weekly fresh whole milk consumption for 43 publication areas across the U.S. The direct estimates of the U.S. Bureau of Labor Statistics for these areas were associated with high coefficients of variation. With the FH model, we got more precise estimates.

## Video & Further Resources

We saw two exemplary applications of the Battese-Harter-Fuller and Fay-Herriot model in R.

There are many more topics on small area estimation! You might want to know how to **compare different candidate models** and **validate how well a chosen model fits your data**. Check out the books Small Area Estimation and A Course on Small Area Estimation and Mixed Models.

The BHF and FH models are two standard SAE models. For certain application, you might want to include additional features into the modelling process. Maybe you want to **robustify the estimation**, consider **measurement errors in the auxiliary information**, or are interested in **multivariate modelling**.

Luckily, there is an ever-growing number of **R packages designed for different small area model types**, available on CRAN: BayesSAE, geoSAE, hbsae, JoSAE, maSAE, mcmcsae, msae, msaeDB, msaeRB, msaenet, NSAE, robustsae, rsae, sae, saeBest, saeeb, saeHB, saekernel, saeME, saemix, saeRobust, saery, saeSim, SAEval, zipsae.

If you want to see more applications of small area models for poverty mapping, health, and income estimation, take a look at the following YouTube video from the UNStats channel.

**Please accept YouTube cookies to play this video.** By accepting you will be accessing content from YouTube, a service provided by an external third party.

If you accept this notice, your choice will be saved and the page will refresh.

You may also be interested in the following articles on Statistics Globe:

- Small Area Estimation Techniques
- Fay-Herriot Small Area Estimator
- Battese-Harter-Fuller Small Area Estimator
- Introduction to R Programming for Beginners

In this article, we have shown an application of the Battese-Harter-Fuller model and the Fay-Herriot model in the R programming language. Use the comment section to get in touch for any questions.

This page was created in collaboration with Anna-Lena Wölwer. Have a look at Anna-Lena’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.

### Statistics Globe Newsletter

## 7 Comments. Leave new

This information is very helpful!

I was wondering how the following value is calculated from modelling the “milk” data?

mod_FH$est$fit$refvar

I tried to reproduce this value by the following steps:

1. Apply the model intercept and beta coefficients to the data and calculate the fitted value

2. Get residual = fitted value – yi

3. Get random effect =residual-SD/sqrt(ni) , where the latter is the sampling error (standard error) of the estimate

4. Calculate the variance of random effect

I was hoping that the result from Step 4 reproduces mod_FH$est$fit$refvar, they are not the same. Can you please check what I did wrong here. Thank you!

My R codes:

#####1 model coefficient

coeff <- as.data.frame (FH$est$fit$estcoef)

coeff <- tibble::rownames_to_column(coeff, "Parameter")

coeff %

mutate (MajorArea=as.numeric(stri_sub(Parameter,-1)))

#####2 merge with model data

milk <- left_join (milk, coeff[,c(2,6)])

milk %

mutate (beta=ifelse(is.na(beta),0, beta))

#####3 calculate fitted value (model estimate, not composite estimate yet)

milk %

mutate(intercept=subset(coeff[,2], coeff$Parameter==”X(Intercept)” ),

estvalue=intercept+beta)

#####4 model residual

milk$residual=milk$yi-milk$estvalue

milk$randomeff=milk$residual-milk$SD/sqrt(milk$ni)

milk %>% summarise (estvariance=var(randomeff))

estvariance

[1] 0.03117389

> FH$est$fit$refvar

[1] 0.01855022

#I expected that the two values above are equal. But they are not. Why?

Hey Stanley,

Thank you for the kind comment, glad you find the article helpful!

I have forwarded your question to Anna-Lena Wölwer. She has more experience with Small Area estimation techniques.

Regards,

Joachim

Hi Joachim,

Thank you for your reply! Looking forward to hearing back from your team.

Stanley

Dear Stanley,

Thanks for your reply. I am not 100% sure what exactly you calculated. Make sure not to mix up residuals and random effects.

When you want to calculate the estimated random effects from the model: Take a look at the formulas of the FH model. In the following, I show you two ways to calculate the random effects from the code of the blog post.

Hope this helps!

Best

Anna-Lena

Hi Anna-Lena,

Thank you for your kind advice. I was trying to reproduce the value from mod_FH$est$fit$refvar, that is, random effect variance. Your Xbeta is the same as what I calculated as the vector estvalue. So if I subtract Xbeta from eblup, would I get random effects? If I further take the variance of (eblup-Xbeta), would I get the same value as mod_FH$est$fit$refvar? I tried, but they are still not the same.

Any of your feedback will be appreciated!

Stanley

Hi Stanley,

These two are not the same. You can see that from a small experiment using the code presented on Fay-Herriot Small Area Estimator (Example in R) | Area-Level Model.

To be more precise, here is a small code with increasing number of areas, where we generate data according to the FH model:

Best

Anna-Lena

Thank you so much Anna-Lena! I will save the codes for my deep dive analysis.

Best

Stanley