Environmental Engineering Reference
In-Depth Information
2.2 ESSENTIAL ENVIRONMENTAL STATISTICS
2.2.1 Measurements of Central Tendency
and Dispersion
Data from populations or samples can be characterized by two important descriptive
statistics: the center (central tendency) and the variation (dispersion) of the data.
Here the population refers to an entire subject body to be investigated (all fishes in a
lake, all contaminated soils in a property), and a sample is the portion of the body
taken in order to represent the true value of the population.
The central tendency is measured by three general methods: the mean, the
median, and the mode. Population mean (
) is the true value, but in most case, it is an
unknown. Sample mean (x), or the commonly used arithmetic mean, is calculated by
adding all values and dividing by the total number of observations:
m
n X
n
1
x ¼
x i
ð2
:
12Þ
i¼1
The geometric mean (GM) is used when the data is not very symmetrical (skewed),
which is common in environmental data. To use the geometric mean, all values must
be non-zero. Geometric mean is defined as:
GM¼ exp X lnðx i Þ=
GM ¼
p
x 1 x 2 x 3 ...x n
or
n
ð2
:
13Þ
Put another way, geometric mean is calculated by adding the logarithms of the
original data, summing these, and dividing by the number of samples (n).
The median (
x) is defined as the middle value of the data set when the set has
been arranged in a numerical order. If the number of measurements (n) is odd, the
median is the number located in the exact middle of the list (Eq. 2.14). If the number
(n) is even, the median is found by computing the mean of the two middle numbers
(Eq. 2.15).
~
~
x ¼ x ðnþ1Þ=2
ðif n is oddÞ
ð2
:
14Þ
~
x ¼ 1
=
2ðx n=2 þx n=2þ1 Þ ðif n is evenÞ
ð2
:
15Þ
Median is sometimes a preferred choice if there are some extreme values. Unlike the
mean, it is less affected by extreme data points (i.e., more ''robust'') and it also is not
affected by data transformations. The median is, therefore, recommended for data
distributions that are skewed as a result of the presence of outliers. Median is also
advantageous as compared to mean in environmental data when non-numerical data
exist, such as ''not detected'' or ''below detection limit''.
The mode (M) is the value in the data set that occurs most frequently. When two
values occur with the same greatest frequency, each one is a mode and the data set is
said to be bimodal. For the data set 1,1,2,2,2,9,10,11,11, the mode is 2. Mode does
not always exist, that is, when no value is repeated, the data set is described as
Search WWH ::




Custom Search