Information Technology Reference
In-Depth Information
documents, in the same way, users of online historical documents should be warned
against incredible or incomplete sources. We distinguish here three types of
credibility aspects from the viewpoint of automatic knowledge acquisition in
historical document archives:
credibility of document metadata
￿ credibility of document content
￿ credibility of collection archive
Credibility of metadata is about the trustworthiness of the document description. One
common metadata problem of historical artifacts is the inability of determining the
correct authorship. Many times the creator of an artifact is not known or only his
pseudonym is revealed. Another common metadata issue is related to dating artifacts.
Here the question to be asked is: “was the source actually created or published at the
given time?” Documents which cannot be accurately positioned in time or whose
timestamp is inaccurate have little value or can even harm the knowledge acquisition
process. Time uncertainty of past page versions discussed above is related to the
credibility of metadata of web page components.
The credibility of document relates to the question whether a given document is
original and has not been altered in any form or whether all the previous alterations
are explicitly known. In the previous section we briefly explained the related concept
of content uncertainty in the history of a web page.
A very simple solution for automatically evaluating the credibility of documents
and their metadata is to employ machine learning methods for outlier detection. For
example, if a given document contains terms very different from the ones appearing
in other documents created at the same time, then its credibility or at least the
credibility of its creation time is questionable and should be manually examined.
The last credibility type relates to the possible bias in the collection construction. In
the process of longitudinal knowledge extraction from historical archives, one often
implicitly assumes that the collection reflects the popularity of information or
frequency of published documents as it was actually in the past. However, if a given
archive has been constructed in a way in which certain information or sources are
over- or under-represented; then, using such archive may result in biased or inaccurate
knowledge. The archive could be useful for the knowledge creation process only if
one knew the scope of the bias introduced during the collection creation.
4 Related Work
Web archiving community has been recently actively involved in the issues of content
selection, preservation and management. The overview of the current state-of-the-art
as well as future directions in this area can be found in [11].
Particular cases in which web archives should be useful for users such as the ones
in legal trials or topic-focused report writing were listed by the International Internet
Preservation Consortium [6]. Visual Knowledge Builder [14] was an early proposal of
an application for history navigation in private hypertexts. The authors' objective was
to enable users to playback the history of hypertexts much like in VCR players. Users
Search WWH ::




Custom Search