Automatic Knowledge Acquisition from Historical Document Archives: Historiographical Perspective - Culture and Computing

Information Technology Reference

In-Depth Information

added time dimension. Browsing paths can be specified by the archive administrators

or can simply reflect the previous link structure that has been recorded at the time of

document archiving.

Wayback Machine is the best known application for accessing data in web archives

using web-based interface [11]. It is used as the gateway to the Internet Archive's 3

web collection, which is the largest online repository of past page versions. Accessing

past content of pages can be done by directly entering specially modified URL

containing requested date. Another way is through the directory page listing available

past versions. Users can access the content of versions and even follow their links as

the links are rewritten to point to the corresponding versions within the archive's

collection. The directory page indicates also page versions that contain changes when

compared to their consecutive versions by marking them with asterisks.

Search is usually regarded as the process of retrieving particular required

information or document, or for finding starting point for browsing activity. In case of

temporal archives it often involves determination of the time constraints for the

publication dates of documents to be retrieved. Most of online news archives enable

temporal search over their collections for returning news articles published within

requested time frames. Magazines such as Time 4 , Newsweek 5 and plenty of less

popular ones provide online searching facilities to their proprietary news articles'

collections, which often have been accumulated over long time spans (e.g., 86 years

for Time and 29 years for Newsweek). In addition, libraries throughout the world

have recently begun digitizing their content and organizing into searchable and

browseable digital collections. On the top of that, large web companies like Google,

Yahoo! 6 or Microsoft 7 started assembling data from multiple distributed news

providers.

In the case of web archives, although many collections like the Internet Archive

provide only URL-based access, some of them such as Portuguese Web Archive 8 and

Pandora 9 already enable textual search.

2.2 Automatic Knowledge Extraction

Browsing and searching offer direct access to stored data. In many cases these are

standard and the only way to utilize the archived content. However, manually locating

and viewing particular documents in the archives can be tiresome for users. In

addition, retrieved “micro-information” such as single web page versions or past news

articles may not always be interesting and useful for users. On the other hand, the

results of more comprehensive analysis of the larger size archive content could prove

attractive. For example, users may not be interested in the details of particular event

in the past but may want to know the frequency of similar events, their cause-effect

relations, associated trends, differences from the present similar events and so on.

3 http://www.archive.org

4 http://www.time.com/time

5 http://www.newsweek.com

6 http://www.yahoo.com

7 http://www.msnbc.msn.com/

8 http://arquivo-web.fccn.pt/portuguese-web-archive-2?set_language=en

9 http://pandora.nla.gov.au/

Culture and Computing

Search WWH ::

Custom Search

Home