Finding Sentences - Natural Language Processing with Java - page 117

Java Reference

In-Depth Information

Reader reader = new StringReader(paragraph);

DocumentPreprocessor dp = new DocumentPreprocessor(reader);

for (List sentence : dp) {

System.out.println(sentence);

}

On execution, we get the following output:

[When, determining, the, end, of, sentences, we, need, to,

consider, several, factors, .]

[Sentences, may, end, with, exclamation, marks, !]

[Or, possibly, questions, marks, ?]

[Within, sentences, we, may, find, numbers, like, 3.14159,

,, abbreviations, such, as, found, in, Mr., Smith, ,, and,

possibly, ellipses, either, within, a, sentence, ..., ,,

or, at, the, end, of, a, sentence, ...]

By default, PTBTokenizer is used to tokenize the input. The setTokenizerFact-

ory method can be used to specify a different tokenizer. There are several other methods

that can be useful, as detailed in the following table:

Method

Purpose

setElementDelimiter

Its argument specifies an XML element. Only the text inside of those elements will be processed.

setSentenceDelimiter

The processor will assume that the string argument is a sentence delimiter.

setSentenceFinalPuncWords Its string array argument specifies the end of sentences delimiters.

When used with whitespace models, if its argument is true , then empty sentences will be retained.

setKeepEmptySentences

The class can process either plain text or XML documents.

To demonstrate how an XML file can be processed, we will create a simple XML file

called XMLText.xml , containing the following data:

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl"?>

<document>

Next Page

Natural Language Processing with Java

Search WWH ::

Custom Search

Home