Information Technology Reference
In-Depth Information
The last type of repositories are the employees' own workstations. Only the
employee working on this workstation has direct access to these local files on
their machines. These files can hold valuable and important information and are not
replaceable with data from other repositories as they can contain work-in-progress
files and work-related notes.
Without an enterprise search engine, users have to rely on system-native search
interfaces to access the individual repositories. This means, however, that users have
to repeat their search query multiple times, i.e., for each of the repositories, until
they find the information they were looking for. This also means that without the
availability of single sign-on, user needs to re-verify itself to each repositories. The
more repositories are available within an enterprise, the longer this information gath-
ering task can take. By applying distributed search techniques on these repositories,
a single user interface in which all of these repositories will be queried, can be pro-
vided. This means that the user only needs to process a single result list that contains
relevant entries that have been aggregated from relevant repositories.
4.3.2 Crawling and Indexing
In order to access documents using an information retrieval system, they need to be
crawled and indexed first. As outlined above, the specific nature of the repositories in
an enterprise calls for a distributed search infrastructure withmultiple disjoint indices
that need to be created separately. In this section, we discuss important aspects that
need to be considered to prepare these indices.
It is important that the crawling task is properly adapted to the system's resources.
Not all systems have the same amount of memory and processor power. The differ-
ences are particularly evident between dedicated file servers and desktop computers.
Crawling processes on desktop computers must run unobtrusively in the background
and only consume primary memory when no or little activities are currently active.
On the other hand, file servers are capable of running many background tasks and
have little constraints on the size of the memory.
According to [ 12 , 15 ], one of the most important properties of data in an enterprise
environment is the varying degree of the structure of the documents. Documents on
file servers are mainly unstructured data with mostly no explicit references to other
documents. This poses a challenge on how to create a good index structure during
the crawling process. Creating a reliable structure from these unstructured data can
be of benefit to users if they want to categorize their search results. Regarding the
document level security we decided to use the second architecture type proposed by
Bailey et al. [ 4 ]. In this type the search engine controls which documents for which
users can be included in the search results. During indexing we also gathered all
access list for the crawled files and include as part of the files' metadata. We need to
emphasize that in order to keep the most actual access list for every indexed files a
proper re-crawling interval should be configured for the crawling process.
Search WWH ::




Custom Search