Geoscience Reference
In-Depth Information
-
Definition of a distance measure for the high-dimensional data
-
Projection
of
the
data
from
the
high-dimensional
data
space
onto
a
visualizable space
3. Classes: Finding intrinsic groups, called clusters, in a data set
-
Data clustering/unsupervised classification
4. Classifier: Symbolic classifiers to assist human skills of comprehension
-
Machine-generated explanations in form of rules or decision trees
5. Interpretations: Human understanding of clusters
-
Analysis of the spatial distribution of clusters
-
Mindful translation of machine-generated explanations
-
Cluster labeling/finding spatial abstractions
-
Knowledge generation/domain experts gain new insights
6. Testing new insights
We will demonstrate this approach on the UD data set described above. Although
this is intended simply as an illustrative example, nevertheless, some results have
been obtained. For example, potential subclasses of the variable SealedSurface have
been identified. Another piece of knowledge discovered by the presented approach is
the identification of two types of German coastal urban districts. Furthermore, it was
possible to rediscover a predicted cluster of urban districts characterized by a dense
building structure, fragmented open space, and a high degree of sealed surface. Most
urban districts of this cluster belong to the official type highly central of the spatial
monitoring system of the Federal Institute for Research on Building, Urban Affairs
and Spatial Development ( 2013 ). The following semantic is suggested for urban
districts in this cluster: urban districts, regarding density, ecological impacts of soil
sealing, and fragmentation of the urban area.
3.3.1
Initial Data Inspection
The first and most important step in a knowledge discovery process is to gain an
initial overview by inspecting the data set as a whole and closely reviewing each
variable individually. To gain an overview of the data, a heat map of the entire data
set can be made (Wilkinson and Friendly 2009 ). A heat map displays each data point
as an area of colored pixels in a matrix. The presented colors reflect data values. In
particular, missing values can be identified by a unique color (white) so that the
number and distribution of data gaps can be clearly seen. Figure 3.2 gives such an
overview of the UD data scaled to percent and ordered by the official district key
(01001 ::: 16056). In this data set, there are no missing values, normally coded as
“NaN” (IEEE 754-1985). No obvious structures can be identified in this heat map,
for example, in the ordering of data.
Search WWH ::




Custom Search