Databases Reference
In-Depth Information
coordinates (e.g., when clustering houses), and monetary quantities (e.g., you are 100
times richer with $100 than with $1).
2.1.6
DiscreteversusContinuousAttributes
In our presentation, we have organized attributes into nominal, binary, ordinal, and
numeric types. There are many ways to organize attribute types. The types are not
mutually exclusive.
Classification algorithms developed from the field of machine learning often talk of
attributes as being either
discrete
or
continuous
. Each type may be processed differently.
A
discrete attribute
has a finite or countably infinite set of values, which may or may not
be represented as integers. The attributes
hair color
,
smoker
,
medical test
, and
drink size
each have a finite number of values, and so are discrete. Note that discrete attributes
may have numeric values, such as 0 and 1 for binary attributes or, the values 0 to 110 for
the attribute
age
. An attribute is
countably infinite
if the set of possible values is infinite
but the values can be put in a one-to-one correspondence with natural numbers. For
example, the attribute
customer ID
is countably infinite. The number of customers can
grow to infinity, but in reality, the actual set of values is countable (where the values can
be put in one-to-one correspondence with the set of integers). Zip codes are another
example.
If an attribute is not discrete, it is
continuous
. The terms
numeric attribute
and
con-
tinuous attribute
are often used interchangeably in the literature. (This can be confusing
because, in the classic sense, continuous values are real numbers, whereas numeric val-
ues can be either integers or real numbers.) In practice, real values are represented
using a finite number of digits. Continuous attributes are typically represented as
floating-point variables.
2.2
BasicStatisticalDescriptionsofData
For data preprocessing to be successful, it is essential to have an overall picture of your
data. Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
This section discusses three areas of basic statistical descriptions. We start with
mea-
sures of central tendency
(Section 2.2.1), which measure the location of the middle or
center of a data distribution. Intuitively speaking, given an attribute, where do most of
its values fall? In particular, we discuss the mean, median, mode, and midrange.
In addition to assessing the central tendency of our data set, we also would like to
have an idea of the
dispersion of the data
. That is, how are the data spread out? The most
common data dispersion measures are the
range
,
quartiles
, and
interquartile range
; the
five-number summary
and
boxplots
; and the
variance
and
standard deviation
of the data
These measures are useful for identifying outliers and are described in Section 2.2.2.
Finally, we can use many graphic displays of basic statistical descriptions to visually
inspect our data (Section 2.2.3). Most statistical or graphical data presentation software