Geoscience Reference
In-Depth Information
the influence matrix since its elements indicate the data influence on the regression
fit of the data. The matrix elements have also been referred to as the leverage of
the data points: in case of high leverage a unit y-value will highly disturb the fit
( Hoaglin and Welsch 1978 ). Concepts related to the influence matrix also provide
diagnostics on the change that would occur by leaving one data point out, and the
effective information content (degrees of freedom for signal) in the data.
These influence matrix diagnostics are explained in Sect. 4.2 for ordinary least-
squares regression. In Sect. 4.3 the corresponding concepts for linear statistical DA
schemes is derived. It will be shown that observational influence and background
influence complement each other. Thus, for any observation
y i either very large or
very small influence could be the sign of inadequacy in the assimilation scheme,
and may require further investigation. A practical approximate method that enables
calculation of the diagonal elements of the influence matrix for large-dimension
variational schemes (such as ECMWF's operational 4D-Var system) is described
in Cardinalietal. ( 2004 ) and therefore not shown here. In Sect. 4.4 results and
selected examples related to data influence diagnostics are presented, including an
investigation into the effective information content in several of the main types of
observational data. Conclusions are drawn in Sect. 4.5 .
4.2
Classical Statistical Definition of Influence Matrix
and Self-Sensitivity
The ordinary linear regression model can be written:
y D X
“ C ©
(4.1)
where y is an
m 1
vector for the response variable (predictand); X is an
m q
matrix of
q
predictors;
is a
q 1
vector of parameters to be estimated (the
regression coefficients) and
©
is an
m 1
vector of errors (or fluctuations) with
. © / D ¢ 2 I m (that is, uncorrelated
observation errors). In fitting the model ( 4.1 ) by LS, the number of observations
m
expectation E
. © / D 0
and covariance var
has to be greater than the number of parameters
q
in order to have a well-posed
problem, and X is assumed to have full rank
.
The LS method provides the solution of the regression equation as
q
D
X T X
/ 1 X T y . The fitted (or estimated) response vector y is thus:
.
y D Sy
(4.2)
where
X T X
/ 1 X T
S D X
.
(4.3)
is the
m m
influence matrix (or hat matrix). It is easily seen that
S D @ y
@
(4.4)
y
Search WWH ::




Custom Search