Information Technology Reference
In-Depth Information
4.2.1 Enterprise Search
In recent years, various studies has been performed that focus on recognizing the
characteristics and challenges of enterprise search (e.g., [ 12 , 14 , 25 ]). A key develop-
ment in the study of enterprise search was the organization of the Enterprise Track as
part of TREC 2005-2008 (see [ 7 ] for an overview of the first instance of this track).
The provision of common data corpora within TREC enabled the study of impor-
tant research challenges such as the development of better ranking methods, a better
understanding of the users, and research on the creation of relevance assessments.
A side effect of this focus on existing datasets was the limited attention to other
important issues of desktop and enterprise systems, e.g., the crawling and indexing
of data from distributed sources. Mukherjee and Mao [ 25 ] refer to this constant data
accumulation process as a key task of an enterprise search system. They define an
enterprise as an environment with the following characteristics:
Heterogeneous document types: Data can be held in many types of documents,
such as web pages, wiki, pdfs, emails, word documents, etc.
Multiple document repositories: Documents are normally not held in a single file
server or system. Depending on the importance and how critical these documents
are, documents can be saved on dedicated servers, such as file servers or web
servers. Users may have to mount multiple file servers in order to get various
documents.
Access restriction: Enterprise environments consist of hierarchy and roles. There-
fore, each document has its own access list. An enterprise search system must be
able to retain these rules and apply it into their search result.
Data generation process: The pace and frequency of document creation and update
also pose challenges on how to manage index updates since new documents should
be searchable within a reasonable amount of time.
They argue that apart from the need to handle diverse data types (e.g., html pages,
emails, database entries, and other documents), detailed information is required about
the location of these datasets in the intranet of the enterprise and the access rights to
these repositories.
According to Hawking [ 15 ] existing enterprise search systems can be classified
into two categories: (1) systems that create one centralized index and (2) systems that
depend on distributed independent indices. A centralized index can be used when
it is possible to crawl all of the relevant data sources into a single index structure.
However, since in most cases information is stored at different locations and due to
physical constraints such as geographical location, low bandwidth connections and
administration restrictions, gathering data in one search index is not always feasible
[ 15 ]. Because of the advantages distributed indices can offer we decided to choose
this option during the implementation of our enterprise search system.
Another important requirement of an enterprise search system is that it has to
be able to handle security and rights management issues [ 14 , 25 ]. Addressing this
requirement Bailey et al. [ 4 ] introduce different architectures for the application
Search WWH ::




Custom Search