Java Reference
In-Depth Information
This approach is useful for only the simplest problems.
When text is searched, a common technique is to use a data structure called an inverted in-
dex. This process involves tokenizing the text and identifying terms of interest in the text
along with their position. The terms and their positions are then stored in the inverted in-
dex. When a search is made for the term, it is looked up in the inverted index and the posi-
tional information is retrieved. This is faster than searching for the term in the document
each time it is needed. This data structure is used frequently in databases, information re-
trieval systems, and search engines.
More sophisticated searches might involve responding to queries such as: "Where are
good restaurants in Boston?" To answer this query we might need to perform entity recog-
nition/resolution to identify the significant terms in the query, perform semantic analysis
to determine the meaning of the query, search and then rank candidate responses.
To illustrate the process of finding names, we use a combination of a tokenizer and the
OpenNLP TokenNameFinderModel class to find names in a text. Since this technique
may throw an IOException , we will use a try-catch block to handle it. Declare
this block and an array of strings holding the sentences, as shown here:
try {
String[] sentences = { "Tim was a good neighbor.
Perhaps not as good a Bob " +
"Haywood, but still pretty good. Of course Mr. Adam
" +
"took the cake!"};
// Insert code to find the names here
} catch (IOException ex) {
ex.printStackTrace();
}
Before the sentences can be processed, we need to tokenize the text. Set up the tokenizer
using the Tokenizer class, as shown here:
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
We will need to use a model to detect sentences. This is needed to avoid grouping terms
that may span sentence boundaries. We will use the TokenNameFinderModel class
Search WWH ::




Custom Search