Getting to Know Your Data - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Although the mean is the singlemost useful quantity for describing a data set, it is not

always the best way of measuring the center of the data. A major problem with the mean

is its sensitivity to extreme (e.g., outlier) values. Even a small number of extreme values

can corrupt the mean. For example, the mean salary at a company may be substantially

pushed up by that of a few highly paid managers. Similarly, the mean score of a class in

an exam could be pulled down quite a bit by a few very low scores. To offset the effect

caused by a small number of extreme values, we can instead use the trimmed mean ,

which is the mean obtained after chopping off values at the high and low extremes. For

example, we can sort the values observed for salary and remove the top and bottom 2%

before computing the mean. We should avoid trimming too large a portion (such as

20%) at both ends, as this can result in the loss of valuable information.

For skewed (asymmetric) data, a better measure of the center of data is the median ,

which is the middle value in a set of ordered data values. It is the value that separates the

higher half of a data set from the lower half.

In probability and statistics, the median generally applies to numeric data; however,

we may extend the concept to ordinal data. Suppose that a given data set of N values

for an attribute X is sorted in increasing order. If N is odd, then the median is the

middle value of the ordered set. If N is even, then the median is not unique; it is the two

middlemost values and any value in between. If X is a numeric attribute in this case, by

convention, the median is taken as the average of the two middlemost values.

Example2.7 Median. Let's find the median of the data from Example 2.6. The data are already sorted

in increasing order. There is an even number of observations (i.e., 12); therefore, the

median is not unique. It can be any value within the two middlemost values of 52 and

56 (that is, within the sixth and seventh values in the list). By convention, we assign the

average of the two middlemost values as the median; that is,

52 C 56

2 D 10 2 D 54. Thus,

the median is $54,000.

Suppose that we had only the first 11 values in the list. Given an odd number of

values, the median is the middlemost value. This is the sixth value in this list, which has

a value of $52,000.

The median is expensive to compute when we have a large number of observations.

For numeric attributes, however, we can easily approximate the value. Assume that data

are grouped in intervals according to their x i data values and that the frequency (i.e.,

number of data values) of each interval is known. For example, employees may be

grouped according to their annual salary in intervals such as $10-20,000, $20-30,000,

and so on. Let the interval that contains the median frequency be the median inter-

val . We can approximate the median of the entire data set (e.g., the median salary) by

interpolation using the formula

N

!

width ,

2 P freq

=

l

median D L 1 C

(2.3)

freq median

where L 1 is the lower boundary of the median interval, N is the number of values in

the entire data set, P freq

is the sum of the frequencies of all of the intervals that are

l

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home