Databases Reference
In-Depth Information
Although the mean is the singlemost useful quantity for describing a data set, it is not
always the best way of measuring the center of the data. A major problem with the mean
is its sensitivity to extreme (e.g., outlier) values. Even a small number of extreme values
can corrupt the mean. For example, the mean salary at a company may be substantially
pushed up by that of a few highly paid managers. Similarly, the mean score of a class in
an exam could be pulled down quite a bit by a few very low scores. To offset the effect
caused by a small number of extreme values, we can instead use the trimmed mean ,
which is the mean obtained after chopping off values at the high and low extremes. For
example, we can sort the values observed for salary and remove the top and bottom 2%
before computing the mean. We should avoid trimming too large a portion (such as
20%) at both ends, as this can result in the loss of valuable information.
For skewed (asymmetric) data, a better measure of the center of data is the median ,
which is the middle value in a set of ordered data values. It is the value that separates the
higher half of a data set from the lower half.
In probability and statistics, the median generally applies to numeric data; however,
we may extend the concept to ordinal data. Suppose that a given data set of N values
for an attribute X is sorted in increasing order. If N is odd, then the median is the
middle value of the ordered set. If N is even, then the median is not unique; it is the two
middlemost values and any value in between. If X is a numeric attribute in this case, by
convention, the median is taken as the average of the two middlemost values.
Example2.7 Median. Let's find the median of the data from Example 2.6. The data are already sorted
in increasing order. There is an even number of observations (i.e., 12); therefore, the
median is not unique. It can be any value within the two middlemost values of 52 and
56 (that is, within the sixth and seventh values in the list). By convention, we assign the
average of the two middlemost values as the median; that is,
52 C 56
2 D 10 2 D 54. Thus,
the median is $54,000.
Suppose that we had only the first 11 values in the list. Given an odd number of
values, the median is the middlemost value. This is the sixth value in this list, which has
a value of $52,000.
The median is expensive to compute when we have a large number of observations.
For numeric attributes, however, we can easily approximate the value. Assume that data
are grouped in intervals according to their x i data values and that the frequency (i.e.,
number of data values) of each interval is known. For example, employees may be
grouped according to their annual salary in intervals such as $10-20,000, $20-30,000,
and so on. Let the interval that contains the median frequency be the median inter-
val . We can approximate the median of the entire data set (e.g., the median salary) by
interpolation using the formula
N
!
width ,
2 P freq
=
l
median D L 1 C
(2.3)
freq median
where L 1 is the lower boundary of the median interval, N is the number of values in
the entire data set, P freq
is the sum of the frequencies of all of the intervals that are
l
 
Search WWH ::




Custom Search