Small Area Estimation Techniques | Analysis of Income, Poverty & Health

 

When working with small-scale income, health, or poverty measures, you’ve probably come across Small Area Estimation (SAE) techniques. SAE techniques impact our everyday life as they often serve as the data basis for political decision-making. For example, the World Bank uses SAE techniques for poverty mapping, the U.S. Census Bureau’s uses them for Small Area Income and Poverty Estimates (SAIPE). In this post, we give you an overview about SAE techniques. If you are new to survey statistics, do not worry, we also summarize some survey basics before turning to small area estimation.

We cover the following topics:

 

Anna-Lena Wölwer Survey Statistician & R Programmer

This page was created in collaboration with Anna-Lena Wölwer. Please have a look at Anna-Lena’s author page to get more information about her academic background and the other articles she has written for Statistics Globe.

 

Survey Basics

To understand small area estimation, we need to have an idea of survey estimation and understand the terms domain, direct estimator, indirect estimator, mean squared error (MSE), and variance. If you know these terms already, you can go straight to the next section.

What is a survey sample?

For political decisions, we need reliable data. That’s why official statistics conduct nationwide surveys like a census or yearly household surveys. In a survey, a random sample is drawn from the target population, e.g. the citizens of a country, and interviewed. With this sample, we can estimate quantities of the population such as the total number of persons in a specific age class, or socio-economic indicators for poverty or health.

What is a domain or an area?

Estimates for the full population are all well and good, but mostly we are interested in much more detailed domain-specific information. A domain is a population sub-group, also called area or sub-population. Domains can be defined by regional, temporal, or demographic aspects, as well as combinations of these three. For example, we can define our domains of interest as the cross-combinations of 5 states X 5 age classes X 12 months (total of 300 domains). The states are regional information, the months are a time-specific information, and age-classes are demographic information. With a survey, we want to get estimates for various variables like poverty, living conditions, or employment at various domain-levels, for example for states, counties, and school districts.

What is a direct estimator?

There are different ways in which we can use sample information for domain estimation. A direct estimator for a domain A considers only the sample information from domain A for the estimation. For example, if you wanted to estimate the number of employed persons in a state, you would only consider sample information from that state. Pretty straightforward, right?

What is an indirect estimator?

You probably already guessed it: When there are direct estimators, there are also indirect estimators. To produce an estimate for domain A, an indirect estimator uses not only sample information from domain A, but also information from other domains. The idea is to borrow strength from combining the information of different domains and increase the effective sample size of the estimation. Indirect estimators use implicit or explicit models to formulate a link between the information of different domains. Indirect estimators are the key methods used for small area estimation.

What is auxiliary information in survey estimates?

You probably heard the term auxiliary information or covariates many times when you are familiar with data modelling. Auxiliary information are additional information which we can use in the estimation process of a survey. This could be: Sample information from other domains, information from a different survey, information from the last Census, or register information like tax records. Both direct and indirect estimators can make use of auxiliary information.

How precise is an estimator?

A random sample is a random sample! That means, for each random draw we get a different sample and thus different estimates. The mean squared error (MSE) of an estimator represents its dispersion among the different possible samples and helps us to see how much we can trust single estimates. For the MSE, we have the following relationship: \(MSE = Variance + Bias^2\). An estimator with non-zero bias is called biased. For an unbiased estimator, the MSE equals its variance. If a country consists of 1 million persons, then an estimator based on random samples of 10,000 persons will certainly have lower MSE than the same estimator based on random samples of 100 persons. Only with the MSE or variance, we can say how reliable certain survey estimates are.

Direct versus indirect estimators

Indirect estimators seem nice, they use more information and models. So why don’t we always use indirect instead of direct estimators? Well, both have advantages and disadvantages. Direct estimators are (asymptotically) unbiased with respect to the sampling process. For domains with large sample sizes, they yield reliable estimates without the need for any modelling process. For domains with small sample sizes, however, they can inhibit large variances. Many indirect estimators, at least the small area models we cover here, are model-based. Model-based estimators are generally not unbiased with respect to the sampling process. On the other hand, their MSE does not depend so much on the domain sample sizes as the MSE of direct estimators. Therefore, for domains with small sample sizes, indirect estimators can give much more precise estimates than direct estimators. But be careful here! The features of model-based estimators strongly depend on how well the chosen model fits the data. If the model is poorly chosen, the estimators can be severely biased and have higher MSEs than direct estimators!

If you want to know more about the theory of survey statistics, we recommend the book Sampling: Design and Analysis for a general introduction and Mosel-Assisted Survey Sampling if you want to become a real survey nerd.

 

Small Area Estimation

Now that we have brushed up on our survey knowledge, we finally get to Small Area Estimation.

What is a small area?

One could guess that a small area is just a domain which is particularly small. Well, not necessarily. Surveys are designed to produce precise direct estimates for chosen key domains of interest. For other domains, the sample sizes can be so small that direct estimators have a high variance. For example, direct estimates calculated from a survey can be accurate for individual states but inaccurate for the much smaller counties. Domains for which direct estimators are not accurate enough, are called small areas or small domains. Whether a domain is considered small does therefore not depend on the size of the domain itself, but on the sample size in the domain. The question is: Does the sample size in the domain result in precise direct estimators? If not, we call it a small domain or small area.

How to get accurate estimates in small areas?

To get accurate estimates for small areas or small domains, there are two possibilities: (1) We could change the sampling design and increase the sample sizes in the domains. This would decrease the variance of the direct estimators. However, surveys have cost limitations. It is not possible to have high sample sizes for all potential domains of interest. (2) We can apply Small Area Estimation techniques. SAE techniques are designed to handle the problem of small areas by the use of indirect estimators. Indirect estimators combine sample information from different domains and potentially additional auxiliary information. They link the different information by use of implicit models or explicit models. There are many different ways in which one can formulate the models for indirect estimators.

We want to introduce two of the most famous of these small area models, the Fay-Herriot (FH) model which is an area-level model and the Battese-Harter-Fuller (BHF) model which is a unit-level model. The models are special kinds of linear mixed models (LMMs). We make it easy for ourselves at this point by simply assuming knowledge of the theory of mixed models. If you want to know more about mixed model theory and how so-called Empirical Best Linear Unbiased Predictors (EBLUPs) are derived under these models, we recommend you to take a look at Mixed Models: Theory and Applications with R for a general introduction to mixed models and A Course on Small Area Estimation and Mixed Models for an overview with the focus on small area models.

 

The Fay-Herriot Model

Let’s start with some notation. Consider a population \(U\) consisting of \(N\) persons, for example, all persons in a country. We want to calculate estimates for all states in the population. We therefore consider the \(D\) states as sub-populations of \(U\) and denote them by \(U_d\), \(d=1,\ldots,D\). Variable \(y_{i}\), \(i=1,\ldots,N\), is the height of a person \(i\). Let’s assume we wanted to estimate the average height of people in each state. The average height is defined as the domain average of the \(y_i\), \(\mu_d=\frac{1}{N_d} \sum_{i \in U_d} y_{i}\).

The Fay-Herriot (FH) model is defined in two stages. The first stage is the sampling model:
\begin{equation}
\hat{\mu}^{Dir}_d = \mu_d + e_d,\quad d=1,\ldots,D.
\end{equation}

We go through the formula step by step. The \(\mu_d\) are our parameters of interest, the average height of the persons in state \(d\). By use of a sample, we can estimate these parameters via direct estimators, \(\hat{\mu}^{Dir}_d\) are the resulting estimates. The estimates are associated with sampling errors \(e_d\). Sampling errors \(e_d \sim N(0, \sigma^2_{ed})\) are normally distributed random variables with zero mean (as direct estimators are unbiased) and variances \(\sigma^2_{ed}\). In most small area models, we take variances \(\sigma^2_{ed}\) as fixed, known quantities. Note, however that variances \(\sigma^2_{ed}\) of the direct estimators are in fact also estimated!

The second stage is the linking model:
\begin{equation}
\mu_d = \boldsymbol{x}_d^{\top} \boldsymbol{\beta} + u_d,\quad d=1,\ldots,D.
\end{equation}

Again, let’s go through the different quantities. In the model, we assume that the parameters of interest \(\mu_d\) are linearly related to a vector of \(p\) auxiliary variables with values \(\boldsymbol{x}_d\). For example, the average height of persons in a state could be related to the age- and sex distribution in the state. We could then use state-specific information on age and sex distributions as auxiliary information. In the standard FH model, we can see that we assume a linear relationship between \(\boldsymbol{x}_d\) and \(\mu_d\) and that the model holds for all \(D\) domains. Therefore, we have the same \(p\)-vector of fixed effects \(\boldsymbol{\beta}\) for all domains. In addition to the fixed effects, we include random effects \(u_d\) in the model. Random effects \(u_d \sim N(0, \sigma^2_{u})\) are assumed to be identically and independently distributed normal random variables with mean zero and variance \(\sigma^2_u\). Sampling errors \(e_d\) and random effects \(u_d\) are assumed to be independent.

Putting both stages of the model together, we have the Fay-Herriot model
\begin{equation}
\hat{\mu}^{Dir}_d = \boldsymbol{x}_d^{\top} \boldsymbol{\beta} + u_d + e_d,\quad d=1,\ldots,D.
\end{equation}

All quantities which go into the FH model are defined at the domain- or area-level. It is therefore called an area-level model. Based on LMM theory, we can calculate an estimate of the variance component \(\hat{\sigma}^2_u\) and of the fixed effects \(\hat{\boldsymbol{\beta}}\). For that, we can for example use maximum likelihood (ML) or restricted maximum likelihood (REML) estimation. For these likelihood-based estimations, the distributional assumptions for the random terms and sampling errors are essential. By applying mixed model theory to the FH model, we can derive small area predictions of the domain parameters \(\mu_d\) and estimates for their MSE. The so-called Empirical Best Linear Unbiased Predictor (EBLUP) of \(\mu_d\) under the FH model is given by
\begin{equation}
\hat{\mu}_d^{FH} = \hat{\gamma}_d \hat{\mu}^{Dir}_d + (1-\hat{\gamma}_d) \boldsymbol{x}_d^{\top} \hat{\boldsymbol{\beta}},\quad d=1,\ldots,D
\end{equation}
with shrinkage factor
\begin{equation}
\hat{\gamma}_d = \frac{\hat{\sigma}^2_u}{\hat{\sigma}^2_u + \sigma^2_{ed}}.
\end{equation}

The FH model is quite useful as the information needed to calculate it is not very confidential, it consists of domain aggregated. To calculate the model, all we need: Direct estimates \(\hat{\mu}^{Dir}_d\) with variances \(\sigma^2_{ed}\) and domain-specific auxiliary information \(\boldsymbol{x}_d\) for all \(D\) domains.

 

The Battese-Harter-Fuller Model

If we are lucky and do not only have aggregate information, but also information on the basis of all sampling unit, we can not only calculate a FH model, but also a Battese-Harter-Fuller (BHF) model. The BHF model is a unit-level model. In the literature, you will also come across the terms nested error regression model and random intercept model for the BHF model.

Again, consider a population \(U\) consisting of \(N\) persons and \(D\) domains of interest, for example states. We can partition population \(U\) in sub-populations \(U_d\) of size \(N_d\), \(N=\sum_{d=1}^{D}N_d\), \(d=1,\ldots,D\). The sample size of the survey is \(n\), in the single domains it is \(n_d\), \(n=\sum_{d=1}^{D}n_d\). The total sample \(s\) can also be partitioned into the domain-specific samples \(s_d\).

The BHF model is defined as
\begin{equation}
y_{di} = \boldsymbol{x}_{di}^{\top} \boldsymbol{\beta} + u_d + e_{di},\quad d=1,\ldots,D,\quad i=1,\ldots, N_d.
\end{equation}
In this model, \(y_{di}\) is the value of our variable of interest for the \(i\)th person in domain \(d\), for example the height of that person. Also the auxiliary information are now at the unit-level. \(\boldsymbol{x}_{di}\) is the \(p\)-vector of auxiliary information for the \(i\)th person in domain \(d\). For example, the age class and sex of the person. As in the FH model, \(u_d \sim N(0, \sigma^2_{u})\) are domain-specific random effects with zero expectation and variance \(\sigma^2_u\). Random errors \(e_{di} \sim N(0, \sigma^2_{e})\) have zero expectation and variance \(\sigma^2_e\). We assume independence of the \(u_d\) and \(e_{di}\).

Again, our analysis is focused on estimating domain averages \(\mu_d = \frac{1}{N_d} \sum_{i \in U_d} y_{di}\). In this simple case, for calculating the BHF model we only unit-level auxiliary information \(\boldsymbol{x}_{di}\) to be known for the persons in the sample. In addition, we need the domain averages \(\bar{\boldsymbol{x}}_{U_d}=\frac{1}{N_d} \sum_{i \in U_d} \boldsymbol{x}_{di}\) to be known, for example from Census records or registers.

With the mixed model theory, we can calculate variance component estimate \(\hat{\sigma}^2_u\), an estimate of the model variance \(\hat{\sigma}^2_e\), and fixed effect estimates \(\hat{\boldsymbol{\beta}}\), for example via ML and REML based on assumption of the normal distribution. The so called Empirical Best Linear Unbiased Predictor (EBLUP) of \(\mu_d\) under the BHF model is given by
\begin{equation}
\hat{\mu}_d^{BHF} = \hat{\gamma}_d (\bar{y}_d – \bar{\boldsymbol{x}}_d \hat{\boldsymbol{\beta}} ) + \bar{\boldsymbol{x}}_{U_d}^{\top} \hat{\boldsymbol{\beta}},\quad d=1,\ldots,D
\end{equation}
with shrinkage factor
\begin{equation}
\hat{\gamma}_d = \frac{\hat{\sigma}^2_u}{\hat{\sigma}^2_u + \hat{\sigma}^2_{e}/n_d},
\end{equation}
and sample means
\begin{equation}
\bar{y}_d=\frac{1}{n_d} \sum_{i \in s_d} {y}_{di},\quad \bar{\boldsymbol{x}}_d=\frac{1}{n_d} \sum_{i \in s_d} \boldsymbol{x}_{di}.
\end{equation}

Note that for non-linear parameters such as most poverty indicators, we would in fact need the auxiliary information known for all units of the population in order to calculate the BHF model.

 

Further Notes on Small Area Models

We have seen the formulas of two of the most prominent SAE models, the Fay-Herriot and Battese-Harter-Fuller model. With both models, we get model-based predictions of the parameters of interest, for example of the average height of the persons in a state. Depending on the availability of sample and auxiliary data, we can either use the unit-level BHF model or the area-level FH model.

For small domains, both the FH \(\hat{\mu}_d^{FH}\) and BHF \(\hat{\mu}_d^{BHF}\) predictions can be much more precise than direct estimates \(\hat{\mu}_d^{Dir}\) in terms of their MSE. However, the FH and BHF model are model dependent! The validity of their predictions strongly depends on the validity of the model. Therefore, one carefully has to choose the auxiliary information in the model and validate how well the chosen model reflects the data.

It can happen that even when the auxiliary information are carefully chosen, the FH or BHF model just do not fit to the data at hand. In this case, one would need to apply other variants of the models. At the Small Area Estimation conference, we can see many extensions of small area models tailored to the needs of specific applications. For example, the models were extended to the theory of generalized linear models (GLMMs); include measurement errors in the auxiliary information; consider non-normal distributions of random effects; or consider non-linear link functions. Furthermore, a large area of research that we have left out for this introduction is the estimation of the MSE of small area predictions.

 

Video, Further Resources & Summary

Do you need more explanations and information on some recent applications of concerning Small Area Estimation? Then you should have a look at the following YouTube video of the INEGIInforma YouTube channel. In the video, SAE expert J. N. K. Rao gives a detailed introduction into different SAE techniques. He is also Co-Author of the book Small Area Estimation which we highly recommend if you are interested in knowing more about small area estimation.

 

 

You may also have a look at the other SAE articles on this website. They provide additional theoretical explanations as well as example code in the R programming language:

This post has shown an introduction to Small Area Estimation. In case you have further questions, you may leave a comment below.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top