Information Technology Reference
In-Depth Information
the distribution of the variables clicks and length , which indeed behave similarly. On
the other hand, we have decided not to remove observations possibly outlying with
respect to the start variable. This because of the nature of the variable itself as well as
the observed distribution.
The resulting visitors dataset contains 22152 observations, in place of the initial
22527.
In consideration of the heterogeneous nature of the navigators of the site, confirmed
by the exploratory stage, we have decided to perform a cluster analysis, in order to
find homogeneous clusters of behaviours. We remark that our primary goal here is not
cluster analysis per se and, thus, cluster analysis can be seen as somehow exploratory
or, better, preliminary, to the local models that we are seeking.
As clustering variables we have considered the three quantitative variables start,
length, clicks, as well as the binary variable purchase, which can all be seen as
instrumental to our objective, of understanding navigation patterns. For the cluster
analysis, we have decided to first run a hierarchical method, to find the number of
groups, and then a non-hierarchical method, to allocate observations in the determined
number of groups. As distance function we have considered the Euclidean distance; as
hierarchical method the method of Ward (after some comparative experiments).
Finally, to allocate observations we have chosen the K-means non-hierarchical
method.
Table 3 summarizes the results of the analysis, showing , for each cluster, its
numerosity and the mean values of the four variables used in the classification. Notice
that, for the purchase variable, the mean represents the proportion of actual
purchases.
From Table 3 notice that there are two bigger groups, 1 and 4, with the other two
smaller. We remark that, from this final cluster allocation, R 2 = 59% , which indicates
a good performance, given the complexity of the data at hand.
The results obtained from cluster analysis confirm heterogeneity of behaviours. To
the purpose of finding navigation patterns, we have decided to concentrate the
analysis on only one cluster. The choice has fallen on the third cluster. This choice is
obviously subjective but, nevertheless, has two important peculiarities. First, the
visitors in this cluster stay connected for long time, and visit many pages. Both these
occurrences allow to better explore the navigation sequences between the web pages.
Second, this cluster has a high probability of purchase; it seems important to consider
the typical navigation pattern of a group of high purchasers.
Therefore, in the following sections, we shall consider a reduced dataset,
corresponding to the third cluster, with 1240 sessions and 21889 clicks. Figure 1
gives, for such cluster, the percentages of visit to each single pages, and Figure 2
gives the corresponding pie diagram.
Before starting with the modelling results we remark that we have appended, in the
cluster dataset, a start_session and an end_session page, respectively, before the first
page and the last page visited in each session. Obviously, such pages are fictitious,
and are not random, but certain. They will serve though to evidentiate the most
common entrance and exit pages.
Search WWH ::




Custom Search