Building a Data Classification System with Mahout - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Using Mahout to Classify Text

Mahout is distributed as a collection of Java libraries that are generally used with the

Hadoop platform. However, in addition to providing Java libraries, Mahout comes

with pre-built components that can be used on the command line. In order to illus-

trate how to use Mahout together with Hadoop to solve a machine learning task, we

will take advantage of some of these useful tools.

The Leipzig Corpora Collection 1 is an effort to provide randomly collected sen-

tences in a common format for multiple languages. The sentences are culled from

either random public Web sites or news sources. The Corpora collection is available

either as MySQL databases or text files. In this case, we will use the Leipzig collec-

tion's samples of Wikipedia sentences in both French and English. Using already cat-

egorized sentences, we will build a training model that can determine the language of

new sentences by “learning” from the training data. In this example, we will create

two separate directories, each of which will contain either English or French sample

documents. Listing 10.1 shows an example of what our sample training data looks like.

As with many distributed processing tools, it's possible to run Mahout locally

without Hadoop, which is useful for testing purposes. In order to run these examples

locally, set the MAHOUT_LOCAL environment variable to TRUE .

Listing 10.1 Using Mahout for Bayesian Classification: input files

# For testing purposes, use Mahout locally without Hadoop

> export MAHOUT_LOCAL=TRUE

# English sample sentences

It is large, and somewhat like the mid-boss of a video game.

Not all democratic elections involve political campaigning.

...

# French sample sentences

Le prince la rassura et il paya son dû.

Sa superficie est de 3 310 hectares.

...

Now that we have set up our raw training data, let's use this information to train a

classifier model. Once this training model is created, we will be able to use it to clas-

sify test data. In order to build a classifier, we will need to tell Mahout a number of

things about our sample dataset.

First, we need to provide the location of the sample data. Next, we will place the

original text files into a format that Mahout can process. The useful seqdirectory tool

will take a list of files in a directory and create Hadoop sequence-format files that can

be used for the next steps in the classification f low (see Listing 10.2).

Search WWH ::

Custom Search

Home