Database Reference
In-Depth Information
4.2.1.2
LDA Calculation with MALLET
To perform the LDA computation in Java, we use the MALLET© 3 library.
Listing 4.4 shows the computation in MALLET. As we can see, most of the work
is done for us, the real effort is in the preprocessing of the documents. To get the
documents ready for LDA, we define a preprocessing pipeline that processes each
document. We'll enumerate our preprocessing pipeline:
1. Lowercase - Strip casing off of all words in the document. “No more media
blackout hiding #OCCUPYWALLSTREET! :)” becomes “no more media black-
out hiding #occupywallstreet! :)”.
2. Tokenize - Convert the string to a list of tokens based on whitespace. This
process also removes punctuation marks from the text. This becomes the list
Œ
.
3. Stopword Removal - Remove “stopwords”, words so common that their
presence does not tell us anything about the dataset.
no, more, media, blackout, hiding, #occupywallstreet
Œ
no, media, blackout, hiding,
.
4. Stemming - Reduce each word to its stem, removing any prefixes or suffixes.
Œ
#occupywallstreet
.
5. Vectorization - Convert the sequence of words to a vector that, instead of
containing the words, contains a sequence of numbers for each word in the
vocabulary. The value at each index corresponds to the number of times each
word appears in the document.
no, media, blackout, hide, #occupywallstreet
Listing 4.4
LDA computation with MALLET
...
private static final String STOP_WORDS = "stopwords.txt" ;
private static final int ITERATIONS = 100;
private static final int THREADS = 4;
private static final int NUM_TOPICS = 25;
private static final int NUM_WORDS_TO_ANALYZE = 25;
...
// Lowercase, tokenize, remove stopwords, and convert to
features
pipeList.add((Pipe) new CharSequenceLowercase());
pipeList.add((Pipe) new CharSequence2TokenSequence(Pattern.
compile( "\\p{L}[\\p{L}\\p{P}]+\\p{L}" )));
pipeList.add((Pipe) new TokenSequenceRemoveStopwords(
stopwords, "UTF-8" , false, false, false));
pipeList.add((Pipe) new PorterStemmer());
pipeList.add((Pipe) new TokenSequence2FeatureSequence());
InstanceList instances = new InstanceList(new SerialPipes(
pipeList));
3 http://mallet.cs.umass.edu/
 
Search WWH ::




Custom Search