Information Technology Reference
In-Depth Information
Automatic Knowledge Acquisition from Historical
Document Archives: Historiographical Perspective
Katsumi Tanaka and Adam Jatowt
Graduate School of Informatics, Kyoto University
Yoshida-Honmachi, Sakyo-ku, 606-8501
Kyoto, Japan
{tanaka,adam}@dl.kuis.kyoto-u.ac.jp
Abstract. Recently many archives containing historical documents have been
created and made open for public use. The availability of such large collections
of past data provides opportunities for new kinds of knowledge extraction. In
this paper we discuss the potential of web and news archives for automatic
acquisition of historical knowledge. We also describe some aspects of the data
and we draw parallel to historiography - the science of making the history.
Keywords: web archive, news archive, historical information, archive usage,
text mining.
1 Introduction
Recently many historical document archives have been created. News and web archives
are probably the most well-known ones. Although, web and news archives are
sometimes regarded simply as loose collections of web pages or news articles, we limit
our focus to their most common form that is the collection of historical documents. We
define document archive as a repository of past documents or their copies containing the
evidence of the state of the past frozen at particular moments in time. According to this
view, a document archive is treated in this paper as the collection of documents which
have been collected in the past, remained unchanged from their original form and
contain metadata such as document timestamp. Page versions in web archives (or news
articles in news archives) are either grouped thematically or according to their other
characteristics such as language, location, domain name, etc. However, often the major
underlying order is the chronological one.
Many technical and sociological issues are related to selecting archival material
and enabling its efficient access and preservation. These topics have been frequently
discussed by the web archiving and digital libraries communities [11]. In case of news
archives the selection and preservation problems seem to be of lesser importance due
to the much smaller data amount as well as the relatively long tradition of collecting
and storing news articles by libraries or other institutions.
Nowadays, people leave more and more traces of their activity in a digital form.
The digitalization is also commonly available making it possible to convert also
traditional print documents to the digital form. In addition, the web encourages people
to publish and interact with others, thus, leaving numerous historical traces that could
 
Search WWH ::




Custom Search