Automatic Knowledge Acquisition from Historical Document Archives: Historiographical Perspective - Culture and Computing

Information Technology Reference

In-Depth Information

can be set according to the query. This querying approach is made in order to reflect

the actual distribution of relevant documents over time within the required time frame

T . By dividing the query into many sub-queries we decrease the effect of temporal

aspect used in the ranking algorithm of the particular collection, and thus, we manage

to rely more on the actual document relevance.

In a crude distinction, the knowledge obtained from historical archives can be

divided into two broad classes:

•

knowledge about a particular source or group of sources and their changes

and evolution

knowledge about the past outlook of the world and the society as well as

about the evolution of a particular topic or information over time

Below, we describe the both classes in more detail.

•

2.2.1 Knowledge on Sources

The first kind of knowledge relates to a given source or information container such as

a web page or newspaper. Among the basic information are the frequency and

changes in appearance of certain words or topics, the age of document components,

the document change frequency or change degree.

For news archive this means characterizing a particular newspaper, magazine etc.

through analyzing past editions and contributions. Such information may be useful for

measuring the characteristics of news sources, identifying the relevant and high

quality ones and so on. In case of web archives this kind of knowledge could add

missing information for users browsing a current page version. For example, the users

could learn about common topics that were discussed on the page recently or long

time in the past. It would be then possible for them to contrast such topics with the

ones published on the present page version. This could provide a context for better

understanding of the current page version as well as the consistency, periodicity and

other temporal characteristics of the page [8,9].

As another kind of knowledge users could receive the information on the age of

certain components on pages in order to support the evaluation of their freshness and

validity. This information would be obtained by comparing past page versions with

the current one. For example, a page component annotated with “new” description

may be discovered to be actually quite old as a result of the comparison of the current

page version with the old page versions [9].

2.2.2 Knowledge on World and Society

The second kind of knowledge can be helpful for understanding the past as well as for

learning about the present - e.g. trends, events, their origins and causes. There are

myriads of potential kinds of such knowledge and the ways in which it could be

utilized. In the simplest form, it can be extracted using summarization, filtering,

association and other text mining technologies on the time series of features.

NY Times API 10 is an example of a programmable interface that offers an effective

tool for news collection mining for such kind of knowledge. For a given time period

one can find the names of objects mentioned in the news articles such as place or

10 http://developer.nytimes.com/

Culture and Computing

Search WWH ::

Custom Search

Home