Information Technology Reference
In-Depth Information
missing from the collection, etc. Therefore, exhaustive data accumulation and
preparation steps are critical to the effectiveness of knowledge discovery processes. In
case of web archives, two types of uncertainties can be distinguished here for a given
series of past page versions. The first type, called content uncertainty , is caused by the
lack of information about the transient content that appeared in the page within the
time periods constrained by the time stamps of the consecutive past page versions.
Consider two versions of a page, v left and v right , captured at time points t left and t right ( t left
< t right ). The probability, P(v i ) , that there is some v i satisfying t left < t i < t right and
containing content different from that in v left and v right depends on many factors such
as the length of the period [ t left , t right ], the type of the page, the content difference
between page versions v left and v right , etc. Basically, the longer the gaps between the
page versions, the greater is the probability of transient content occurring in the page.
The second one, called time uncertainty , relates to estimating dates of detected
content changes. In the above example, the exact timing of the content changes
estimated from the comparison of v left and v right is unknown and can only be crudely
approximated. The time uncertainty, like the content one, depends also on the number
of acquired past page snapshots and their distribution in page history.
web (without web/news archives)
web/news archi ves containin g docum ents
created in d istant past
secondary
sources
primary
sources
primary an d
secondary
sources
Distant past
Near past and present
Fig. 3. Concept of primary and secondary sources in web and web/news archives
Detecting the differences between primary and secondary sources can also provide
interesting insight. Given a popular object (e.g., company, person, place, etc.) one
could compare the amount of attention in both primary and secondary resources about
this object as well as the way in which the authors referred to it. For example, a
scientist could publish a paper that was unrecognized by his peers at the time of the
publication's appearance. Yet, sometime later the paper would be considered as
highly influential. Table 1 portrays this concept. Also, sentimental attitude to given
events may change over time. The differences in sentimental attitude could be
detected by comparing the sentiment expressions used in primary and secondary
sources (Table 2).
Search WWH ::




Custom Search