Java Reference
In-Depth Information
dog A pet dog taking part in Christmas traditions ...
dog The majority of contemporary people with dogs describe
their ...
dog Another study of dogs' roles in families showed many
dogs have ...
dog According to statistics published by the American Pet
Products ...
dog The latest study using Magnetic resonance imaging (MRI)
...
cat Cats are common pets in Europe and North America, and
their ...
cat Although cat ownership has commonly been associated ...
cat The concept of a cat breed appeared in Britain during
...
cat Cats come in a variety of colors and patterns. These
are physical ...
cat A natural behavior in cats is to hook their front claws
periodically ...
cat Although scratching can serve cats to keep their claws
from growing ...
When creating training data, it is important to use a large enough sample size. The data we
used is not sufficient for some analysis. However, as we will see, it does a pretty good job
of identifying the categories correctly.
The
DoccatModel
class supports categorization and classification of text. A model is
trained using the
train
method based on annotated text. The
train
method uses a
string denoting the language and an
ObjectStream<DocumentSample>
instance
holding the training data. The
DocumentSample
instance holds the annotated text and
its category.
In the following example, the
en-animal.train
file is used to train the model. Its in-
put stream is used to create a
PlainTextByLineStream
instance, which is then con-
verted to an
ObjectStream<DocumentSample>
instance. The
train
method is
then applied. The code is enclosed in a try-with-resources block to handle exceptions. We
also created an output stream that we will use to persist the model:
DoccatModel model = null;
try (InputStream dataIn =