Information Aggregation in an Enterprise - Smart Information Systems: Computational Intelligence for Real-Life Applications

Information Technology Reference

In-Depth Information

The last type of repositories are the employees' own workstations. Only the

employee working on this workstation has direct access to these local files on

their machines. These files can hold valuable and important information and are not

replaceable with data from other repositories as they can contain work-in-progress

files and work-related notes.

Without an enterprise search engine, users have to rely on system-native search

interfaces to access the individual repositories. This means, however, that users have

to repeat their search query multiple times, i.e., for each of the repositories, until

they find the information they were looking for. This also means that without the

availability of single sign-on, user needs to re-verify itself to each repositories. The

more repositories are available within an enterprise, the longer this information gath-

ering task can take. By applying distributed search techniques on these repositories,

a single user interface in which all of these repositories will be queried, can be pro-

vided. This means that the user only needs to process a single result list that contains

relevant entries that have been aggregated from relevant repositories.

4.3.2 Crawling and Indexing

In order to access documents using an information retrieval system, they need to be

crawled and indexed first. As outlined above, the specific nature of the repositories in

an enterprise calls for a distributed search infrastructure withmultiple disjoint indices

that need to be created separately. In this section, we discuss important aspects that

need to be considered to prepare these indices.

It is important that the crawling task is properly adapted to the system's resources.

Not all systems have the same amount of memory and processor power. The differ-

ences are particularly evident between dedicated file servers and desktop computers.

Crawling processes on desktop computers must run unobtrusively in the background

and only consume primary memory when no or little activities are currently active.

On the other hand, file servers are capable of running many background tasks and

have little constraints on the size of the memory.

According to [ 12 , 15 ], one of the most important properties of data in an enterprise

environment is the varying degree of the structure of the documents. Documents on

file servers are mainly unstructured data with mostly no explicit references to other

documents. This poses a challenge on how to create a good index structure during

the crawling process. Creating a reliable structure from these unstructured data can

be of benefit to users if they want to categorize their search results. Regarding the

document level security we decided to use the second architecture type proposed by

Bailey et al. [ 4 ]. In this type the search engine controls which documents for which

users can be included in the search results. During indexing we also gathered all

access list for the crawled files and include as part of the files' metadata. We need to

emphasize that in order to keep the most actual access list for every indexed files a

proper re-crawling interval should be configured for the crawling process.

Search WWH ::

Custom Search

Home