GENERALIZED LEAST SQUARES (Social Science)

Generalized least squares (GLS) is a method for fitting coefficients of explanatory variables that help to predict the outcomes of a dependent random variable. As its name suggests, GLS includes ordinary least squares (OLS) as a special case. GLS is also called "Aitken’s estimator," after A. C. Aitken (1935). The principal motivation for generalizing OLS is the presence of covariance among the observations of the dependent variable or of different variances across these observations, conditional on the explanatory variables. Both phenomena lead to problems with statistical inference procedures commonly used with OLS. Most critically, the standard methods for estimating sampling variances and testing hypotheses become biased. In addition, the OLS-fitted coefficients are inaccurate relative to the GLS-fitted coefficients.

In its simplest form, the linear model of statistics postulates the existence of a linear conditional expectation for a scalar, dependent random variable y given a set of non-random scalar explanatory variables {x,, …, xJ:

tmpCB-7_thumbtmpCB-8_thumbtmpCB-9_thumb


In addition, the linear model assumes that the variances of the yn are equal to a common, finite positive constant a2 and that the covariances among the yn are equal to zero. In matrix notation, these assumptions assign to y a scalar variance-covariance matrix:

tmpCB-10_thumb

where I denotes an N X N identity matrix. The fundamental difference between such a linear model and one leading to generalized least squares is that the latter permits an unrestricted variance-covariance matrix, often denoted by

tmpCB-11_thumbtmpCB-12_thumb

Many authors refer to the generalized model as the linear model with nonspherical errors. This term derives, in part, from viewing y as the sum of X|3 and an additional, unobserved variable that is an error term. Rather than making assumptions about the observable y and X as above, these writers make equivalent assumptions about the unobserved error term. The term nonspherical refers to the type of variance-covariance matrix possessed by the error term. Multivariate distributions with scalar variance-covariance matrices are often called spherical. This term can be traced to interpreting the set

tmpCB-13_thumb

as an N-dimensional sphere (or spheroid) with radius a. In the nonscalar case, the set

tmpCB-14_thumb

is an N-dimensional ellipsoid and distributions with non-scalar variance-covariance matrices are called nonspherical. Hence, a linear regression accompanied by a nonscalar variance-covariance matrix may be called the case with nonspherical errors.

EXAMPLES

Leading examples motivating nonscalar variance-covari-ance matrices include heteroskedasticity and first-order autoregressive serial correlation. Under heteroskedasticity, the variances a differ across observations n = 1, …, N nn but the covariances a , m ^ n, all equal zero. This mn i occurs, for example, in the conditional distribution of individual income given years of schooling where high levels of schooling correspond to relatively high levels of the conditional variance of income. This heteroskedastic-ity is explained in part by the narrower range of job opportunities faced by people with low levels of schooling compared to those with high levels.

Serial correlation arises in time-series data where the observations are ordered sequentially by the time period of each observation; yn is observed in the nth time period. First-order autoregressive (AR(1)) serial correlation occurs when deviations from means (also called errors) satisfy the linear model

tmpCB-15_thumb

while maintaining the assumption that the marginal variance ofyn equals a constant a . Nonzero covariances of the form

tmpCB-16_thumb

are implied by the recursion

tmpCB-17_thumb

A times series of monthly unemployment rates exhibits such autoregressive serial correlation, reflecting unobserved social, economic, and political influences that change relatively slowly as months pass.

A second leading example of serial correlation occurs in panel data models, designed for datasets with two sampling dimensions, typically one cross-sectional and the other time-series. Repetitive testing of a cross-section of subjects in a laboratory gives this structure as do repeated surveys of a cross-section of households. Panel data models are usually expressed in an error components form:

tmpCB-18_thumb

for m = n and t ^ s. Unlike the AR(1) case, this covariance does not diminish as the time between observations increases. Instead, all of the observations for an individual are equally correlated.

Correlation also occurs in cross-sectional data. In the seemingly unrelated regressions (SUR) setting, there are several dependent variables and corresponding mean functions:

tmpCB-19_thumb

Such dependent variables are typically related as different characteristics of a single experiment or observational unit. For example, the y might be test scores for substantively different tests written by the same individual. Even after accounting for observable differences among the tests and test takers with x , covariance among the test scores may reflect the influence of unobserved personal abilities that affect all of the tests taken by a particular person. Alternatively, the y^could be total income in countries during the same time period so that neighboring states possess similar underlying characteristics or face similar environments that induce covariance among their incomes.

STATISTICAL ISSUES

The general linear model motivates two principal issues with statistical inferences about |3 in the simpler linear model. First, hypothesis tests and estimators of sampling variances and confidence intervals developed under the linear model are biased when X is not scalar. Second, the OLS estimator for |3 generally will not be the minimum-variance linear unbiased estimator. The OLS estimator

tmpCB-20_thumb

is a linear (in y) and unbiased estimator when X is not scalar. However, its sampling variance is

tmpCB-21_thumbtmpCB-22_thumbtmpCB-23_thumb

is the minimum-variance linear and unbiased estimator. Its variance-covariance matrix is

tmpCB-24_thumb

GLS can be understood as OLS applied to a linear model transformed to satisfy the scalar variance-covari-

tmpCB-25_thumb

so that the expectation of the transformed y has corresponding transformed explanatory variables X = A XX. Applying OLS to estimate |3 with the transformed variables yields the GLS estimator:

tmpCB-26_thumb

In a similar fashion, one sees

tmpCB-27_thumb

In a similar fashion, one sees that the OLS criterion function is transformed into the GLS criterion function:

tmpCB-28_thumb

Heteroskedasticity produces a simple example. To produce observations with equal variances, each data point is divided by the standard deviation

tmpCB-29_thumb

This corresponds to choosing A-1 equal to a diagonal matrix with the reciprocals of these standard deviations arrayed along its diagonal. The estimation criterion function is

tmpCB-30_thumb

which is a weighted sum of squared residuals. For this reason, in this special case GLS is often called weighted least squares (WLS). WLS puts most weight on the observations with the smallest variances, showing how GLS improves upon OLS, which puts equal weight on all observations. Those n for which a is relatively small tend to be closest to the mean of yn and, hence, more informative about p.

Faced with AR(1) serial correlation in a time series, the appropriate choice of A transforms each data point (except the first) into differences:

tmpCB-31_thumb

This transformed yn display zero covariances:

tmpCB-32_thumb

using (2) for the first and third terms on the right-hand side. This transformation uncovers the new or additional information available in each observation, whereas OLS treats highly correlated observations the same way as uncorrelated observations, giving the former relatively too much weight in that estimator.

The panel data model has a simple GLS transformation as well:

tmpCB-33_thumbtmpCB-34_thumbtmpCB-35_thumb

If there is no serial correlation, then a = 0 and ynt = y . Conversely, the greater aa is, the more important the individual average yn becomes. Like the AR(1) case, a weighted difference removes the covariance among the original ynt. In this case, however, a common time-series sample average appears in every difference, reflecting the equal covariance structure.

Note that the GLS estimator is an instrumental variables (IV) estimator,

tmpCB-36_thumb

for an N X Kmatrix Z of instrumental variables such that Z’X is invertible. For GLS, Z = X-1X Researchers use instrumental variables estimators to overcome omission of explanatory variables in models of the form

tmpCB-37_thumb

where s is an unobserved term. Even though E[s] = 0, correlation between the explanatory variables in x and s biases PQls and the IV estimator is employed to overcome this bias by using instrumental variables, the variables in Z, that are uncorrelated with s yet correlated with the explanatory variables. In some cases of the linear model, the GLS estimator provides such instrumental variables. If, for example, xn includes the lagged value of yn in a time-series application, then residual serial correlation usually invalidates the OLS estimator while GLS still produces an estimator for |3.

In the panel data setting, particular concern about the behavior of the unobserved individual effect a has led researchers to compare the GLS estimator with another IV estimator. The concern is that the expected value of an may vary with some of the observed explanatory variables in x . Various observable characteristics of individuals or households are typically correlated so that one would expect the unobserved characteristics captured in an to be correlated with the observed characteristics in x as well.

In this situation, the OLS- and GLS-fitted coefficients are not estimators for |3 because these fitted coefficients pick up the influence of the an omitted as explanatory variables. An IV estimator of |3 that is robust to such correlation is the so-called fixed effects estimator. This estimator

tmpCB-38_thumb

In the special case when u = 0, the fixed effects and GLS estimators are equal. The GLS estimator is often called the random effects estimator in this context, and the difference between the fixed-effects and random-effects estimators is often used as a diagnostic test for the reliability of GLS estimation (Hausman 1978).

The OLS and GLS estimators are equal for a general X if the GLS instrument matrix X-1X produces the same set of fitted values as the explanatory variable matrix X.

tmpCB-39_thumb

A practical situation in which this occurs approximately is when AR(1) serial correlation is accompanied by explanatory variables that are powers of n or trigonometric functions of n. Another example arises when all covariances are equal (and not necessarily zero) and the regression function includes an intercept (or constant term), as it usually does. A third example is the case of SUR where the explanatory variables are identical for all equations, so that

tmpCB-40_thumb

FEASIBLE METHODS

Feasible inference for |3 in the general linear model typically must overcome that X is unknown. There are two popular strategies: (1) to specify X as a function of a few parameters that can be replaced with estimators, and (2) to use heteroskedasticity-consistent variance estimators.

The AR(1) serial correlation model illustrates the first approach. A natural estimator for the autocorrelation parameter p is the fitted OLS coefficient p for predicting

tmpCB-41_thumb

explanatory variable yn _ 1 — X _ jPqls’ the lagged OLS-fitted residual:

tmpCB-42_thumbtmpCB-43_thumbtmpCB-44_thumb

differences between the feasible and infeasible versions are negligible. In small samples, many researchers use an estimator that requires iterative calculations to find a p and P that are mutually consistent: The fitted residuals produced by P yield p and the variance-covariance matrix produced by p yields p as the fitted FGLS coefficients. Maximum likelihood estimators, based on an additional assumption that the yn possess a joint multivariate normal distribution, are leading examples of such estimators.

We will use the pure heteroskedasticity case to illustrate heteroskedasticity-consistent variance estimators. The unknown term in the Var[pOLS] (shown in (3)) can be written as a sample average:

tmpCB-45_thumb

where a2 = a , the nth diagonal element of X. In a het-eroskedasticity-consistent variance estimator this average is replaced by

tmpCB-46_thumb

so that the unknown variances a2n are replaced by the squared OLS fitted residuals. Such estimators do not require a parametric model for X and, hence, are more widely applicable. Their justification rests, in part, on

tmpCB-47_thumb 

so that one can show that

tmpCB-48_thumbtmpCB-49_thumb

heteroskedasticity-consistent variance estimator replaces the unknown |3 with its estimator Pols. This variance-covariance estimator is often called the "Eicker-White estimator," for Friedjielm Eicker and Halbert White.

The heteroskedasticity-consistent variance estimator does not yield a direct counterpart to PFGls. Nevertheless, estimators that dominate OLS are available. The transformed linear model

tmpCB-50_thumb

has a corresponding variance-covariance matrix

tmpCB-51_thumb

which has a heteroskedasticity-consistent counterpart

tmpCB-52_thumb

and the FGLS analogue

tmpCB-53_thumbtmpCB-54_thumb

The heteroskedasticity-consistent variance estimator has been extended to cover time-series cases with nonzero covariances as well. For example, if only first-order covari-ances are nonzero then

tmpCB-55_thumb

because a . = 0 for j > 1. This term in the OLS vari-ance-covariance matrix can be estimated by

tmpCB-56_thumb

a heteroskedasticity and autocorrelation consistent (HAC) variance-covariance matrix estimator. This works because the second average behaves much like the first in that

tmpCB-57_thumb

so that one can show that

tmpCB-58_thumb

is an estimator for the second term.

One can extend the HAC approach to cover m-dependence in which only mth-order covariances are nonzero for a finite m. However, in practice m should be small relative to the number of observations N. To illustrate the difficulties with large m, consider setting m = N — 1 so that all of the covariances in X are replaced by a product of OLS-fitted residuals. Then this approach yields the estimator

tmpCB-59_thumbtmpCB-60_thumb

K. Nevertheless, the heteroskedasticity-consistent vari-ance-covariance estimator has been generalized to cover situations where all of the covariances may be nonzero. The Newey-West estimator is a popular choice:

tmpCB-61_thumb

where

tmpCB-62_thumb

and

tmpCB-63_thumb

The supporting approximate distribution theory requires m to depend on the sample size Nand methods for choosing m are available.

Often statistical inference for |3 based upon estimation of X or X’XX can treat these terms as equal to the objects that they estimate. For example, the statistical distribution theory typically shows that

tmpCB-64_thumb

is approximately (or exactly) distributed as a chi-squared random variable. This pivotal statistic yields a hypothesis test or confidence interval for R3. In large samples,

tmpCB-65_thumb

may be treated as an equivalent statistic. Researchers have shown that bootstrap methods, appropriately applied, can provide better probability approximations in situations with small sample sizes.

Next post:

Previous post: