Conclusion - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

recently, the combination of different learning methods is increasingly being

used. A survey of these techniques can be found in [ 54 ].

An example of a rule-based information extraction system is SystemT-

IE [ 107 , 172 ], a result from research carried out at the IBM Almaden Research

Center. This system is now included as part of the InfoSphere BigInsights

suite of products. SystemT-IE comes with an SQL-like declarative language

denoted Annotation Query Language (AQL) [ 30 ] for specifying text analytics

extraction programs (called extractors) with rule semantics. Extractors

obtain structured information from unstructured or semistructured text. For

this, AQL extends SQL with the EXTRACT statement. Data in AQL are

stored in relations where all tuples have the same schema, analogous to

SQL relational tables. In addition, AQL includes statements for creating

tables, views, user-defined functions, and dictionaries. However, AQL does

not support advanced SQL features like correlated subqueries and recursive

queries. After the extractors are generated, an optimizer produces an ecient

execution plan for them in the form of an annotation operator graph. The

plan is finally executed by a runtime engine, which takes advantage of parallel

architectures.

Information extraction techniques like the one introduced above can be

used to populate a data warehouse for multidimensional text analysis using

OLAP. Typically, an ETL process will extract textual data from various

sources and, after cleansing and transformation, will load such data into

the warehouse. The phases of this process will include textual data and

metadata extraction from documents; transformation of the extracted data

through classic text retrieval techniques like cleaning texts, stemming, term

weighting, language modeling, and so on; and loading the data resulting from

the transformation phase into the data warehouse. In a recent topic [ 92 ], W.H.

Inmon details, at a high abstraction level, the tasks required to be performed

by an ETL process for text data warehouses.

We next discuss some proposals in the field of text data warehouses.

In one of the earliest works in the field, Tseng and Chou [ 206 ]propose

a document warehouse for multidimensional analysis of textual data. Their

approach is to combine text processing with numeric OLAP processing. For

this, they organize unstructured documents into structured data consisting

of dimensions, hierarchies, facts, and measures. Dimensions are composed of

a hierarchy of keywords referring to a concept, which are obtained using

text mining tools. Facts include the identifiers of the documents under

analysis and the number of times that a combination of keywords appears

in such documents. For example, suppose we are analyzing documents in

order to discover products and cities appearing together. A hierarchy of

keywords referring to products can be represented in a Product dimension

with schema (ProductKey, Keyword, KeywordLevel, Parent) ,where ProductKey

is the surrogate key of the keyword, Keyword is a word in the document,

KeywordLevel is the level of Keyword in the hierarchy, and Parent is the parent

KeywordLevel . For example, a hierarchy of keywords such as TV

→

Appliance

Data Warehouse Systems: Design and Implementation

Search WWH ::

Custom Search

Home