Building the Healthcare Information Factory - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

●

Multiple language translation—the ability to read text in one language and reference it and create

a database in another language.

●

Homographic resolution—the ability to take words that are spelled the same way and to expand

the words into a properly phrased set of words based on the context of the document.

●

Variable pattern recognition—the ability to read a word and recognize its word type merely based

on the structure of the word, such as an email address.

●

Variable symbol recognition—the ability to recognize and index certain words inside a document.

●

Semi-structured data recognition.

Categorizations of data

An important component of textual integration is the ability to recognize external categorizations of

data. Once external categorizations of data are recognized, the text can be understood in an abstract

manner. In many ways the abstraction of raw text into external categorizations is the equivalent of

modeling data. It is through external categorization of data that query tools have the ability to access

and analyze textual data.

There are many ways to create external categorizations of data. One of the most important of the

ways is to create external categorizations such as taxonomies. Taxonomies can be built internally on a

customized basis. Or there are commercially available taxonomies (see Wand, Inc., http://www.wand-

inc.com/ ) . A commercially available taxonomy is easy to use and is immediately available. In addi-

tion, there are a wide variety of commercially available taxonomies.

As a rule the analyst building the data warehouse from textual data chooses only those taxonomies

that are appropriate to the text. For example, if the text was focused on orthopedics, the analyst would

not choose to use an external taxonomy for obstetrics for categorization of the text.

One way to look at the value of textual integration is that if text is merely shuffled from one docu-

ment to a database, no real textual integration has occurred and textual analytics cannot be done.

There are many benefits to doing the processing as described. The major benefit is that the text is

integrated. By integrating the text, analytical processing can be done against the text. From a mechan-

ical standpoint, standard analytical tools can be used against the text once the text is placed into a

standard database. Analytical tools such as SAS, Business Objects, Cognos, MicroStrategy, Crystal

Reports, and others can be run against the textual data found in the textually based database. Figure

B.16 shows that once integrated into a database, text can be analyzed by standard tool sets.

Given the granular nature of the textual data and the fact that there is a wide diversity of sources,

it is possible to do analytical processing against the integrated text in a variety of ways. Figure B.17

shows that once the integrated textual data has been collected and integrated, that the integrated text

can be used in many ways.

One analyst can look at the information that has been gathered and integrated from the perspec-

tive of cancer research. Another analyst can use the data that has been gathered and integrated from

the standpoint of heart research. Still another analyst can look at and analyze the same data from the

standpoint of geriatric research, and so forth. Figure B.17 shows that the textual analytical data that

has been collected is very versatile.

An example

As an example of the research that can be done with the textual data, consider a research institution

that has treated heart patients for 30 years. Doctors' notes have been collected for 30 years' time as

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home