Analyzing Twitter Data - Twitter Data Analytics - page 42

Database Reference

In-Depth Information

4.2.1.2

LDA Calculation with MALLET

To perform the LDA computation in Java, we use the MALLET© 3 library.

Listing 4.4 shows the computation in MALLET. As we can see, most of the work

is done for us, the real effort is in the preprocessing of the documents. To get the

documents ready for LDA, we define a preprocessing pipeline that processes each

document. We'll enumerate our preprocessing pipeline:

1. Lowercase - Strip casing off of all words in the document. “No more media

blackout hiding #OCCUPYWALLSTREET! :)” becomes “no more media black-

out hiding #occupywallstreet! :)”.

2. Tokenize - Convert the string to a list of tokens based on whitespace. This

process also removes punctuation marks from the text. This becomes the list

Œ

.

3. Stopword Removal - Remove “stopwords”, words so common that their

presence does not tell us anything about the dataset.

no, more, media, blackout, hiding, #occupywallstreet

Œ

no, media, blackout, hiding,

.

4. Stemming - Reduce each word to its stem, removing any prefixes or suffixes.

Œ

#occupywallstreet

.

5. Vectorization - Convert the sequence of words to a vector that, instead of

containing the words, contains a sequence of numbers for each word in the

vocabulary. The value at each index corresponds to the number of times each

word appears in the document.

no, media, blackout, hide, #occupywallstreet

Listing 4.4

LDA computation with MALLET

...

private static final String STOP_WORDS = "stopwords.txt" ;

private static final int ITERATIONS = 100;

private static final int THREADS = 4;

private static final int NUM_TOPICS = 25;

private static final int NUM_WORDS_TO_ANALYZE = 25;

...

// Lowercase, tokenize, remove stopwords, and convert to

features

pipeList.add((Pipe) new CharSequenceLowercase());

pipeList.add((Pipe) new CharSequence2TokenSequence(Pattern.

compile( "\\p{L}[\\p{L}\\p{P}]+\\p{L}" )));

pipeList.add((Pipe) new TokenSequenceRemoveStopwords(

stopwords, "UTF-8" , false, false, false));

pipeList.add((Pipe) new PorterStemmer());

pipeList.add((Pipe) new TokenSequence2FeatureSequence());

InstanceList instances = new InstanceList(new SerialPipes(

pipeList));

3 http://mallet.cs.umass.edu/

Next Page

Twitter Data Analytics

Search WWH ::

Custom Search

Home