Databases Reference
In-Depth Information
2. Extracting the most prominent features of the data and ignoring the rest.
We shall explore these two approaches in the following sections.
1.1.4
Summarization
One of the most interesting forms of summarization is the PageRank idea, which
made Google successful and which we shall cover in Chapter 5. In this form
of Web mining, the entire complex structure of the Web is summarized by a
single number for each page. This number, the “PageRank” of the page, is
(oversimplifying somewhat) the probability that a random walker on the graph
would be at that page at any given time. The remarkable property this ranking
has is that it reflects very well the “importance” of the page - the degree to
which typical searchers would like that page returned as an answer to their
search query.
Another important form of summary - clustering - will be covered in Chap-
ter 7. Here, data is viewed as points in a multidimensional space. Points
that are “close” in this space are assigned to the same cluster. The clusters
themselves are summarized, perhaps by giving the centroid of the cluster and
the average distance from the centroid of points in the cluster. These cluster
summaries become the summary of the entire data set.
Example 1.2 : A famous instance of clustering to solve a problem took place
long ago in London, and it was done entirely without computers. 2 The physician
John Snow, dealing with a Cholera outbreak plotted the cases on a map of the
city. A small illustration suggesting the process is shown in Fig. 1.1.
Figure 1.1: Plotting cholera cases on a map of London
2 See http://en.wikipedia.org/wiki/1854 Broad Street cholera outbreak .
Search WWH ::




Custom Search