Text Mining in the Context of Business Intelligence

INTRODUCTION

Information about the external environment and organizational processes are among the most worthwhile input for business intelligence (BI). Nowadays, companies have plenty of information in structured or textual forms, either from external monitoring or from the corporative systems. In the last years, the structured part of this information stock has been massively explored by means of data-mining (DM) techniques (Wang, 2003), generating models that enable the analysts to gain insights on the solutions for organizational problems. On the text-mining (TM) side, the rhythm of new applications development did not go so fast. In an informal poll carried out in 2002 (Kdnuggets), just 4% of the knowledge-discovery-from-databases (KDD) practitioners were applying TM techniques. This fact is as intriguing as surprising if one considers that 80% of all information available in an organization comes in textual form (Tan, 1999).
In their popular model to explain the phases of technology adoption (Figure 1), Moore and McKenna (1999) discuss the existence of a chasm between the “early adopters, visionaries,” and the “early majority pragma-tists” phases that a technology has to cross in order to become extensively adopted. From our point of view, TM is crossing this chasm yet. Although there is the existence of mature tools in the market, and an increasing number of successful case studies have been presented (Ferneda,Prado, & Silva, 2003; Fliedl & Weber, 2002; Dini & Mazzini, 2002; Prado, Oliveira, Ferneda, Wives, Silva, & Loh, 2004),it seems that the community is still leaving the second phase. However, the results presented in the case studies point out that the broad adoption of TM will happen in the near future.

BACKGROUND

When studying the relations between TM and BI, it is necessary to take into account an important intermediate layer between them: the knowledge-management (KM) process. KM refers to the set of activities responsible for carrying the information along the organization and making knowledge available where it is necessary.
To clarify the relations between TM and BI, under the point of view of a KM model, we adopted the generic KM model (Figure 2) proposed by Stollenwerk (2001). The model is made up of seven processes: (a) identification and development of the critical abilities, (b) capture of knowledge, skills, and experiences to create and maintain skills, (c) selection and validation that filter, evaluate, and summarize the acquired knowledge for future use, (d) organization and storage to assure the quick and correct recovery of the stored knowledge, (e) sharing that makes easy the access to information and knowledge, (f) application in which the knowledge is applied in real situations, and (g) creation that comprises the activities of sharing tacit knowledge, creating concepts, building archetypes, and cross-leveling knowledge. Involving the mentioned processes, there exist the aspects of leadership, organizational culture, measuring and compensation, and technology. The main relation between TM and KM is located in the creation process. By applying the TM techniques discussed in the next section, it is possible to find patterns that, adequately interpreted, can leverage the concept-creation activity.

Figure 1. Moore’s and McKenna’s (1999) life cycle of technology adoption

Figure 2. Generic KM model of Stollenwerk

METHODS AND TECHNIQUES FOR TEXT MINING

Text mining can be defined as the application of computational methods and techniques over textual data in order to find relevant and intrinsic information and previously unknown knowledge.
Text-mining techniques can be organized into four categories: classification, association analysis, information extraction, and clustering. Classification techniques consist of the allocation of objects into predefined classes or categories. They are used to identify the class or category of texts in tasks such as topic spotting (the identification of a known topic or subject in a document) and document routing or filtering (the selection of relevant documents to a process or to someone).
Association analysis is used to identify correlation or dependencies among elements or attributes (words or concepts present in documents). It helps the identification of words or concepts that co-occur together and, consequently, to understand the contents of a document or set of documents and their relationships.
Information-extraction techniques are able to find relevant data or expressions inside documents. Typical uses of these techniques involve the creation of databases from texts or the identification of specific information (like names, dates, and e-mails) in a large set of documents.
Clustering is the process of finding relationships among texts or words and putting them together in groups of related documents. Clustering techniques are used to understand how the information or knowledge of an entire collection of documents is organized.
In this chapter we focus on clustering techniques because it is more appropriate to the (concept) creation phase since it helps the user analyze and understand previously unknown data, exploiting the relationship among documents and words present in a large collection of documents. The next section describes the clustering process with more detail.

CLUSTERING

Clustering is a knowledge-discovery process that identifies relationships among objects and builds clusters of objects based on these relationships (Jain, Murty, & Flynn, 1999; Willet, 1988). It is based on the cluster hypothesis (Rijsbergen, 1979), which states that similar objects tend to remain together in the same cluster as a consequence of a specific concept distance metric.
Clustering is a widely employed tool to analyze data in many fields (Everitt, Landau, & Leese, 2001; Jain et al., 1999). The idea behind cluster analysis is to find knowledge about previously unfamiliar data. Clustering methods are able to give suggestions about how specific sets of data are organized or correlated. It is possible to identify the similarity and the dissimilarity among many objects or data patterns and, based on that, construct classes or categories. Categories or classes are very important as they are the basic elements to build new concepts, and concepts are the basis of human knowledge (Aldenderfer & Blashfield, 1984).
However, since clustering indicates relationships among objects, it can be used for many other objectives. Aldenderfer and Blashfield (1984), for example, classify the goals of cluster analysis into four categories: (a) to develop a typology or classification, (b) to investigate useful conceptual schemes for grouping entities, (c) to aid in the generation of hypotheses, and (d) to test hypotheses, verifying if types defined through other procedures are really present in the data set.

Clustering Types

There are many clustering methods and this fact generates many types or schemes of clusters. According to Aldenderfer and Blashfield (1984), these methods can be classified into seven families: hierarchical agglomerative, hierarchical divisive, iterative partitioning, density search, factor analytic, clumping, and graph theoretic. Each of them creates a type or scheme of clusters that is very peculiar.
For the sake of generality, we choose to classify and detail the clustering methods according to the categories proposed by Everitt et al. (2001) and Schutze and Silverstein (1997). These categories are (a) hierarchical and (b) nonhierarchical (or partitioning clustering).

Hierarchical Clustering

In hierarchical clustering, the resulting scheme of clusters is very similar to a tree (see Figure 3). Each node represents a cluster. The intermediate clusters are clusters of clusters and the leaves are objects. The relationship among clusters is of paramount importance as it shows the specificities and abstractions among groups of objects. If the user goes up in the tree of clusters, it is possible to identify more abstract or generic groups. On the other hand, if the user goes down, more specific groups will be identified until the objects themselves are reached.

Nonhierarchical or Partitioning Clustering

When working with nonhierarchical clustering, the objects are allocated in isolated clusters and no relationship among clusters can be found. This type of clustering is also known as partitional clustering, and it is said that it generates flat (without structure) partitions of clusters (see Figure 4).

Figure 3. Hierarchic scheme of clusters

Figure 4. Flat partition of isolated clusters

Clustering Algorithms

As already stated, there are many clustering methods. More detail on these and other clustering methods can be obtained in Aldenderfer and Blashfield (1984), Jain et al. (1999), Kowalski (1997), and Willet (1988). In this section we will describe only the methods implemented in the tool used in our experiments: the Eurekha tool (Wives, 1999). The algorithms implemented come from the graph theoretic family of algorithms and are described next.

Stars

The stars algorithm analyzes the objects and tries to find groups of similar elements where the resulting aspect or format is like a star of correlated objects. In this case, the center of the star is the element that has a relation with all the other objects in the cluster, linking them together. It means that the other elements should be near or similar to this central element, but not necessarily to the others. To minimize the dissimilarity among an element that is on one side of the star with another element in another side of the star, it is defined a similarity threshold. A larger threshold among all elements in relation to the center makes the group more coherent. The more they are similar to the center (or near to the center), the more they will be similar to each other.
The algorithm starts selecting any element in the set of elements. This selection can be performed randomly or by any other method. However, the selection order influences the resulting clustering scheme.
The selected element is then elected as the center of the star (the center of the cluster). Then, this element is compared to all other elements not yet clustered (i.e., allocated to a cluster). If a relation is found, meaning that it is greater than a previously user-defined similarity threshold, the element being compared to the center is allocated to the cluster. Once all elements are compared to the star center, another unclustered element is selected and the process continues until all elements are analyzed. The elements in which the similarity to another element is not greater than the established threshold are said to be unclustered and are ignored or allocated to an isolated cluster—one to each element.

Best Star

The main problem of the star algorithm is that the order in which the elements are selected as centers influences the clustering result. Another problem is that the user has to select a threshold of minimum similarity between objects and the center, and there is not an optimal threshold to be used as a usual value. Each data set may have a different threshold. These are the greatest problems of cluster analysis that uses this kind (or family) of algorithms1.
The best-star algorithm intends to solve these problems, allocating an element, even if it is already clustered, to the star where it is more similar (the nearest star). Somehow, a side effect is that the user does not need to establish a threshold. In this case, the elements will be reassigned to the cluster where they are more similar (i.e., nearer to the star’s center).

Cliques

This algorithm is similar to the star algorithm. However, the elements are added only if they satisfy the threshold of similarity among all elements already in the cluster and not only with the central element. In this case, the elements are more tightly coupled and the quality of the resulting clusters is better.

Full Stars

Sometimes the user must know all the clusters where an element would be allocated. All the other algorithms discussed in this chapter allocate the element in the best cluster for it, according to its algorithmic restrictions. This algorithm solves this necessity, allocating an element in all clusters it has a relationship with greater than the threshold established by the user.

FUTURE TRENDS

With regard to the trends in the use of TM in BI, we can see the idea of concept replacing the usual approach of TM based on the words of a text. Words can lead to semantic mistakes, known as the vocabulary problem, for example, when people use synonyms or word variations.
In the conceptual approach for clustering, concepts represent the content of a textual document in a higher level, minimizing the vocabulary problem. Concepts talk about real-world events and objects, and are used by people to express ideas, ideologies, thoughts, opinions, and intentions through the language (in talks, texts, documents, topics, messages, etc.).
In previous works (Loh, Oliveira, & Gastal, 2001; Loh, Wives, & Oliveira, 2000), concepts were used with success in mining processes of textual documents. Thus, using concepts as document attributes in the clustering process contributes to generate better results than using words since the resulting clusters have elements with more cohesion, besides being more understandable.

CONCLUSION

The advent of the knowledge society has imposed an important change in the context of organizations. Business competitiveness is significantly affected by the availability of knowledge about the organizational processes and the external environment. The importance of the information existing in organizations as raw material to create knowledge has been recognized since the late ’80s. As a matter of fact, the use of such knowledge for leveraging the business has led to an increasing number of KDD applications. However, the majority of these applications has been addressed to process structured data rather than the unstructured that is, by far, the biggest part of the organizational information.
The existence of mature tools to develop TM applications and the amount of textual information available in the organizations seem to be a strategic opportunity that cannot be ignored. In this chapter it was discussed the role of TM in BI, clarifying the interface between them.

KEY TERMS

Association Analysis: Use of statistics criteria to measure the proximity of two distinct objects or texts using some of their properties or attributes. A method for identifying corelation or dependencies among elements or attributes, using statistical techniques.
Classification: A systematic arrangement of objects (texts) on groups according to (pre) established criteria, or the process of allocating elements in predefined classes. The classification needs a predefined taxonomy in contrast with the clustering technique that works without previous knowledge. Sometimes it is also associated with the process of identifying classes, that is, discovering attributes that characterize one class and that distinguish this from others.
Cluster analysis: The process that includes the clustering method and the analysis of its results in order to discover and understand the contents of a set of elements, texts, or objects, and the relations among them.
Cluster: A group of elements that have some characteristics or attributes in common
Concept: It is an abstract or generic idea, opinion, or thought generalized from particular instances by the selection of meaningful terms. The concept may be identified by the use of text-mining techniques. They are used to explore and examine the contents of talks, texts, documents, topics, messages, and so forth. Concepts belong to the extralinguistic knowledge about the world, representing real things in formal ways.
Knowledge Discovery: A computer-supported process that uses computational algorithms and tools as visualization methods to help a user to discover knowledge from stored data.
Knowledge Discovery from Texts: A computer-supported process that uses computational algorithms and tools as visualization methods to help a user to discover knowledge from stored textual data. It can be understood that this process gives to the user information that would not be recovered by traditional queries since the information is not explicitly stated or declared in the textual data.
Text Mining: A computer-supported process that uses computational algorithms and tools over textual data with the objective of discovering statistical patterns. Most common methods include clustering, classification, and association analysis. Most of the time, the expression is interchangeable with knowledge discovery from texts, however, the last is a larger process where the first one is involved.
Text Mining by Concepts: The application of text-mining methods over textual documents that are represented or modeled by concepts instead of words