Database Reference
In-Depth Information
Using Mahout to Classify Text
Mahout is distributed as a collection of Java libraries that are generally used with the
Hadoop platform. However, in addition to providing Java libraries, Mahout comes
with pre-built components that can be used on the command line. In order to illus-
trate how to use Mahout together with Hadoop to solve a machine learning task, we
will take advantage of some of these useful tools.
The Leipzig Corpora Collection 1 is an effort to provide randomly collected sen-
tences in a common format for multiple languages. The sentences are culled from
either random public Web sites or news sources. The Corpora collection is available
either as MySQL databases or text files. In this case, we will use the Leipzig collec-
tion's samples of Wikipedia sentences in both French and English. Using already cat-
egorized sentences, we will build a training model that can determine the language of
new sentences by “learning” from the training data. In this example, we will create
two separate directories, each of which will contain either English or French sample
documents. Listing 10.1 shows an example of what our sample training data looks like.
As with many distributed processing tools, it's possible to run Mahout locally
without Hadoop, which is useful for testing purposes. In order to run these examples
locally, set the MAHOUT_LOCAL environment variable to TRUE .
Listing 10.1 Using Mahout for Bayesian Classification: input files
# For testing purposes, use Mahout locally without Hadoop
> export MAHOUT_LOCAL=TRUE
# English sample sentences
It is large, and somewhat like the mid-boss of a video game.
Not all democratic elections involve political campaigning.
...
# French sample sentences
Le prince la rassura et il paya son dû.
Sa superficie est de 3 310 hectares.
...
Now that we have set up our raw training data, let's use this information to train a
classifier model. Once this training model is created, we will be able to use it to clas-
sify test data. In order to build a classifier, we will need to tell Mahout a number of
things about our sample dataset.
First, we need to provide the location of the sample data. Next, we will place the
original text files into a format that Mahout can process. The useful seqdirectory tool
will take a list of files in a directory and create Hadoop sequence-format files that can
be used for the next steps in the classification f low (see Listing 10.2).
1. http://corpora.uni-leipzig.de/download.html
 
 
Search WWH ::




Custom Search