Information Technology Reference
In-Depth Information
added time dimension. Browsing paths can be specified by the archive administrators
or can simply reflect the previous link structure that has been recorded at the time of
document archiving.
Wayback Machine is the best known application for accessing data in web archives
using web-based interface [11]. It is used as the gateway to the Internet Archive's 3
web collection, which is the largest online repository of past page versions. Accessing
past content of pages can be done by directly entering specially modified URL
containing requested date. Another way is through the directory page listing available
past versions. Users can access the content of versions and even follow their links as
the links are rewritten to point to the corresponding versions within the archive's
collection. The directory page indicates also page versions that contain changes when
compared to their consecutive versions by marking them with asterisks.
Search is usually regarded as the process of retrieving particular required
information or document, or for finding starting point for browsing activity. In case of
temporal archives it often involves determination of the time constraints for the
publication dates of documents to be retrieved. Most of online news archives enable
temporal search over their collections for returning news articles published within
requested time frames. Magazines such as Time 4 , Newsweek 5 and plenty of less
popular ones provide online searching facilities to their proprietary news articles'
collections, which often have been accumulated over long time spans (e.g., 86 years
for Time and 29 years for Newsweek). In addition, libraries throughout the world
have recently begun digitizing their content and organizing into searchable and
browseable digital collections. On the top of that, large web companies like Google,
Yahoo! 6 or Microsoft 7 started assembling data from multiple distributed news
providers.
In the case of web archives, although many collections like the Internet Archive
provide only URL-based access, some of them such as Portuguese Web Archive 8 and
Pandora 9 already enable textual search.
2.2 Automatic Knowledge Extraction
Browsing and searching offer direct access to stored data. In many cases these are
standard and the only way to utilize the archived content. However, manually locating
and viewing particular documents in the archives can be tiresome for users. In
addition, retrieved “micro-information” such as single web page versions or past news
articles may not always be interesting and useful for users. On the other hand, the
results of more comprehensive analysis of the larger size archive content could prove
attractive. For example, users may not be interested in the details of particular event
in the past but may want to know the frequency of similar events, their cause-effect
relations, associated trends, differences from the present similar events and so on.
3 http://www.archive.org
4 http://www.time.com/time
5 http://www.newsweek.com
6 http://www.yahoo.com
7 http://www.msnbc.msn.com/
8 http://arquivo-web.fccn.pt/portuguese-web-archive-2?set_language=en
9 http://pandora.nla.gov.au/
Search WWH ::




Custom Search