Database Reference
In-Depth Information
recently, the combination of different learning methods is increasingly being
used. A survey of these techniques can be found in [ 54 ].
An example of a rule-based information extraction system is SystemT-
IE [ 107 , 172 ], a result from research carried out at the IBM Almaden Research
Center. This system is now included as part of the InfoSphere BigInsights
suite of products. SystemT-IE comes with an SQL-like declarative language
denoted Annotation Query Language (AQL) [ 30 ] for specifying text analytics
extraction programs (called extractors) with rule semantics. Extractors
obtain structured information from unstructured or semistructured text. For
this, AQL extends SQL with the EXTRACT statement. Data in AQL are
stored in relations where all tuples have the same schema, analogous to
SQL relational tables. In addition, AQL includes statements for creating
tables, views, user-defined functions, and dictionaries. However, AQL does
not support advanced SQL features like correlated subqueries and recursive
queries. After the extractors are generated, an optimizer produces an ecient
execution plan for them in the form of an annotation operator graph. The
plan is finally executed by a runtime engine, which takes advantage of parallel
architectures.
Information extraction techniques like the one introduced above can be
used to populate a data warehouse for multidimensional text analysis using
OLAP. Typically, an ETL process will extract textual data from various
sources and, after cleansing and transformation, will load such data into
the warehouse. The phases of this process will include textual data and
metadata extraction from documents; transformation of the extracted data
through classic text retrieval techniques like cleaning texts, stemming, term
weighting, language modeling, and so on; and loading the data resulting from
the transformation phase into the data warehouse. In a recent topic [ 92 ], W.H.
Inmon details, at a high abstraction level, the tasks required to be performed
by an ETL process for text data warehouses.
We next discuss some proposals in the field of text data warehouses.
In one of the earliest works in the field, Tseng and Chou [ 206 ]propose
a document warehouse for multidimensional analysis of textual data. Their
approach is to combine text processing with numeric OLAP processing. For
this, they organize unstructured documents into structured data consisting
of dimensions, hierarchies, facts, and measures. Dimensions are composed of
a hierarchy of keywords referring to a concept, which are obtained using
text mining tools. Facts include the identifiers of the documents under
analysis and the number of times that a combination of keywords appears
in such documents. For example, suppose we are analyzing documents in
order to discover products and cities appearing together. A hierarchy of
keywords referring to products can be represented in a Product dimension
with schema (ProductKey, Keyword, KeywordLevel, Parent) ,where ProductKey
is the surrogate key of the keyword, Keyword is a word in the document,
KeywordLevel is the level of Keyword in the hierarchy, and Parent is the parent
KeywordLevel . For example, a hierarchy of keywords such as TV
Appliance
Search WWH ::




Custom Search