Information Aggregation in an Enterprise - Smart Information Systems: Computational Intelligence for Real-Life Applications

Information Technology Reference

In-Depth Information

4.3 Technical Challenges

As mentioned in Sect. 4.2.1 , there are two approaches in realizing search systems for

enterprise users. One possibility is to create a single index that contains documents

from various repositories. However, restrictions such as physical locations, different

administration policies, and bandwidth limitation make the data crawling process

difficult to perform efficiently [ 15 ]. Therefore, the creation of various distributed

indices is more feasible as it eliminates the need to transfer large amount of data

for creating a centralized index. In this section, we outline various conditions and

requirements for creation of a distributed search engine in an enterprise environment.

The main focus of the section is on presenting technical issues that occur when such

search system is set up. Section 4.3.1 first describes the types of data collections that

often occur in enterprise environments. Section 4.3.2 then outlines the required steps

for building multiple indices. The querying process of a distributed search engine is

illustrated in Sect. 4.3.3 .

4.3.1 Typical Data Repositories

Enterprise is an organizational entity with a defined structure and boundaries and

involving many parties with common interest. Through the defined structure and

boundaries, information available within an enterprise environment can typically be

categorized based on their content and their respective access rights.

The first type of information is publicly available and hence can be accessed by

both employees as well as other parties who show interest in the company. A typical

example is the company's webpage that can be accessed from anywhere in the world.

These types of repositories can be freely searched regardless of user's permission.

The second type of repository contains information that can only be accessed

internally within the company's physical network. We can further divide this type

into two categories: (1) repositories that do not need authentication and (2) repos-

itories that require authentication. Intranet webpages, wiki pages, and similar data

repositories that can be found in the company's intranet fall under the first category.

As long as users are using the company's ip-ranges they can freely open and access

the information. The second category represents repositories which contain protected

data, i.e., some sort of authentication is required before they can be accessed. This

means accessing through company's physical address alone is not enough, users

should validate their credential by logging in. Typical examples of such repositories

are file servers. Each file in these servers inherits explicit read and write rights for

individuals, as well as defined groups. By logging in, users will be authenticated and

through this authentication users' rights including information about group member-

ships can be obtained. This credential information predefines and limits which data

or files a user can access. Obviously, a search engine that accesses these repositories

has to consider these permissions to avoid security leakage.

Search WWH ::

Custom Search

Home