Database Reference
In-Depth Information
month of traffic data) and may not represent actual population relationships.
Remember that the goal is to make general inferences and draw, with a
certain degree of confidence, conclusions about the population. The good
news is that statistics can help us with this.
By using statistics we can calculate how likely a sample correlation
coefficient at least as large as the one observed would be, if the null
hypothesis were to hold true. In other words, we can calculate the probability
of such a large observed sample correlation if there is indeed no linear
association in the population.
This probability is called the p -value or the observed significance level
and it is tested against a predetermined threshold value called the significance
level of the statistical test. If the p -value is small enough, typically less than
0.05 (5%), or in the case of large samples less than 0.01 (1%), the null
hypothesis is rejected in favor of the alternative. The significance level of the
test is symbolized by the letter
and it denotes the false positive probability
(probability of falsely rejecting a true null hypothesis) that we are willing
to tolerate. Although not displayed here, in our example the probability of
obtaining such a large correlation coefficient by chance alone is small and
less than 1%. Thus, we reject the null hypothesis of no linear association and
consider the correlation between these two fields as statistically significant at
the 0.01 level.
This logic is applied to various types of data (frequencies, means, other
statistical measures) and types of problems (associations, mean comparisons):
we formulate a null hypothesis of no effect and calculate the probability of
obtaining such a large effect in the sample if indeed there was no effect in the
population. If the probability ( p -value) is small enough (typically less than
0.05) we reject the null hypothesis.
α
The number of outgoing voice calls for instance is positively correlated with
the minutes of calls. The respective correlation coefficient is 0.84, denoting that
customers who make a large number of voice calls also tend to talk a lot. Some
other fields are negatively correlated, such as the percentage of voice and SMS calls
(
0.98). This signifies a contrast between voice and SMS usage, not necessarily
in terms of usage volume but in terms of the total usage ratio that each service
accounts for. Conversely, other attributes do not seem to be related, like Internet
and roaming calls for instance. Studying correlation tables in order to arrive at
conclusions is a cumbersome job. That is where PCA comes in. It analyzes such
tables and identifies groups of related fields.
PCA applied to the above data revealed five components by using the
eigenvalue criterion, which we will present shortly. Table 3.3, referred to as the
Search WWH ::




Custom Search