Differentiate Yourself with Text Analytics - Harness the Power of Big Data

Database Reference

In-Depth Information

goals scored from the UEFA Championships transcripts, and 30 of the 76

passages identified as goal commentaries weren't for goals at all, your

precision would be just under 60 percent. In summary, precision describes

how many passages identified are correctly identified.

• Recall A measure of completeness, the percentage of relevant results

that are retrieved from the text; in other words, are all the valid strings

from the original text showing up? For example, if you wanted to extract

all of the goals scored in the UEFA Championships from video, and got

60 out of 76 that would be found by a human expert, your recall would

be about 79 percent, because your application missed 21 percent of goals

scored. In summary, recall is how many matching passages are found

out of the total number of matching passages.

As analysts develop their extractors and applications, they iteratively

make refinements to tune their precision and recall rates. A great analogy is

an avalanche. If the avalanche didn't pick up speed and more snow as it tum-

bles down a mountain slope, it wouldn't have much impact. The develop-

ment of extractors is really about adding more rules and knowledge to the

extractor itself; in short, it's about getting more powerful with each iteration.

We've found that most marketplace approaches to text analytics present

challenges for analysts, because they tend to perform poorly (in terms of both

accuracy and speed) and they are difficult to build or modify. These approaches

flow the text forward through a system of extractors and filters, with no

optimization. This technique is inflexible and inefficient, often resulting in

redundant processing, because extractors applied later in the workflow

might do work that had already been completed earlier. From what we can

tell, today's text toolkits are not only inflexible and inefficient, they're also

limited in their expressiveness (specifically, the degree of granularity that's

possible with their queries), which results in analysts having to develop cus-

tom code. This, in turn, leads to more delays, complexity, and difficulty in

refining the accuracy of your result set (precision and recall).

The Annotated Query Language

to the Rescue!

To meet these challenges, the IBM Big Data platform provides the Advanced

Text Analytics Toolkit, especially designed to deal with the challenges inher-

ent in Big Data. This toolkit (originally code-named SystemT) has been under

Search WWH ::

Custom Search

Home