Databases Reference
In-Depth Information
Multiple language translation—the ability to read text in one language and reference it and create
a database in another language.
Homographic resolution—the ability to take words that are spelled the same way and to expand
the words into a properly phrased set of words based on the context of the document.
Variable pattern recognition—the ability to read a word and recognize its word type merely based
on the structure of the word, such as an email address.
Variable symbol recognition—the ability to recognize and index certain words inside a document.
Semi-structured data recognition.
Categorizations of data
An important component of textual integration is the ability to recognize external categorizations of
data. Once external categorizations of data are recognized, the text can be understood in an abstract
manner. In many ways the abstraction of raw text into external categorizations is the equivalent of
modeling data. It is through external categorization of data that query tools have the ability to access
and analyze textual data.
There are many ways to create external categorizations of data. One of the most important of the
ways is to create external categorizations such as taxonomies. Taxonomies can be built internally on a
customized basis. Or there are commercially available taxonomies (see Wand, Inc., http://www.wand-
inc.com/ ) . A commercially available taxonomy is easy to use and is immediately available. In addi-
tion, there are a wide variety of commercially available taxonomies.
As a rule the analyst building the data warehouse from textual data chooses only those taxonomies
that are appropriate to the text. For example, if the text was focused on orthopedics, the analyst would
not choose to use an external taxonomy for obstetrics for categorization of the text.
One way to look at the value of textual integration is that if text is merely shuffled from one docu-
ment to a database, no real textual integration has occurred and textual analytics cannot be done.
There are many benefits to doing the processing as described. The major benefit is that the text is
integrated. By integrating the text, analytical processing can be done against the text. From a mechan-
ical standpoint, standard analytical tools can be used against the text once the text is placed into a
standard database. Analytical tools such as SAS, Business Objects, Cognos, MicroStrategy, Crystal
Reports, and others can be run against the textual data found in the textually based database. Figure
B.16 shows that once integrated into a database, text can be analyzed by standard tool sets.
Given the granular nature of the textual data and the fact that there is a wide diversity of sources,
it is possible to do analytical processing against the integrated text in a variety of ways. Figure B.17
shows that once the integrated textual data has been collected and integrated, that the integrated text
can be used in many ways.
One analyst can look at the information that has been gathered and integrated from the perspec-
tive of cancer research. Another analyst can use the data that has been gathered and integrated from
the standpoint of heart research. Still another analyst can look at and analyze the same data from the
standpoint of geriatric research, and so forth. Figure B.17 shows that the textual analytical data that
has been collected is very versatile.
An example
As an example of the research that can be done with the textual data, consider a research institution
that has treated heart patients for 30 years. Doctors' notes have been collected for 30 years' time as
Search WWH ::




Custom Search