Databases Reference
In-Depth Information
2.5
Summary
Data sets are made up of data objects. A
data object
represents an entity. Data objects
are described by attributes. Attributes can be nominal, binary, ordinal, or numeric.
The values of a
nominal
(or
categorical
)
attribute
are symbols or names of things,
where each value represents some kind of category, code, or state.
Binary attributes
are nominal attributes with only two possible states (such as 1 and
0 or true and false). If the two states are equally important, the attribute is
symmetric
;
otherwise it is
asymmetric
.
An
ordinal attribute
is an attribute with possible values that have a meaningful order
or ranking among them, but the magnitude between successive values is not known.
A
numeric attribute
is
quantitative
(i.e., it is a measurable quantity) represented
in integer or real values. Numeric attribute types can be
interval-scaled
or
ratio-
scaled
. The values of an
interval-scaled attribute
are measured in fixed and equal
units.
Ratio-scaled attributes
are numeric attributes with an inherent zero-point.
Measurements are ratio-scaled in that we can speak of values as being an order of
magnitude larger than the unit of measurement.
Basic statistical descriptions
provide the analytical foundation for data preprocess-
ing. The basic statistical measures for data summarization include
mean, weighted
mean, median
, and
mode
for measuring the central tendency of data; and
range, quan-
tiles, quartiles, interquartile range, variance
, and
standard deviation
for measuring the
dispersion of data. Graphical representations (e.g.,
boxplots, quantile plots, quantile-
quantile plots, histograms
, and
scatter plots
) facilitate visual inspection of the data and
are thus useful for data preprocessing and mining.
Data visualization
techniques may be
pixel-oriented, geometric-based, icon-based
, or
hierarchical
. These methods apply to multidimensional relational data. Additional
techniques have been proposed for the visualization of complex data, such as text
and social networks.
Measures of object
similarity
and
dissimilarity
are used in data mining applications
such as clustering, outlier analysis, and nearest-neighbor classification. Such mea-
sures of
proximity
can be computed for each attribute type studied in this chapter,
or for combinations of such attributes. Examples include the
Jaccard coefficient
for
asymmetric binary attributes and
Euclidean
,
Manhattan
,
Minkowski
, and
supremum
distances for numeric attributes. For applications involving sparse numeric data vec-
tors, such as term-frequency vectors, the
cosine measure
and the
Tanimoto coefficient
are often used in the assessment of similarity.
2.6
Exercises
2.1
Give three additional commonly used statistical measures that are not already illus-
trated in this chapter for the characterization of
data dispersion
. Discuss how they can
be computed efficiently in large databases.