Geoscience Reference
In-Depth Information
white == NaN
100
10
90
20
80
30
70
40
60
50
50
60
40
70
30
80
20
90
100
10
110
0
1
2
3
4
5
6
7
OpenSpaceMeshSize ... HemerobyIndex
Fig. 3.2
Heat map of the scaled UD data
3.3.2
Exploring the Distributions of the Individual Variables
The next step of data inspection is to determine the distribution of the individual
variables. Important tools for this inspection are the quantile-quantile plot (QQ-plot)
and kernel estimators for the probability density function (pdf). Here we use the
PDE method for pdf estimation (Ultsch 2003 ) as it is specially designed to uncover
subsets in the variables. Consider, for example, the variable SealedSurface. The
graph on the left of Fig. 3.3 presents the empirically measured pdf as a blue curve.
One can see that the degree of sealed surface in UD data appears to have several
subsets: a small proportion of sealed surface vs. medium and higher percentages of
sealed surfaces. The black lines show a mixture of Gaussians to model these subset
distributions (Bilmes 1998 ; Dempster et al. 1977 ). In this way, a close inspection of
each variable separately offers some initial insights into the data set. The right panel
of Fig. 3.3 gives an indication of the quality of the GMM model. Here the straight
line confirms a good data fit to the model.
The main goal of the inspection of the individual variables is to find out how
they are distributed in comparison to standard distributions. A QQ-plot of the
variable HemerobyIndex versus a Gaussian (N(0,1)) shows that this variable has
an approximately normal (i.e., Gaussian) distribution (cf. Fig. 3.4 ).
The QQ-plots of other variables reveal different types of distribution. While the
variable OpenSpaceMeshSize is clearly non-normally distributed, the logarithm
of
OpenSpaceMeshSize
shows
nearly
a
normal
distribution
(cf.
Fig. 3.5 ).
Search WWH ::




Custom Search