Database Reference
In-Depth Information
Table 4.1 An example of the
most significant words from
one topic from LDA
Word
Probability (%)
ows
8.0
nypd
2.0
occupywallstreet
1.0
park
1.0
occupi
1.0
protest
1.0
nyc
1.0
evict
1.0
citi
1.0
polic
1.0
zuccotti
1.0
...
...
4.2.1
Finding Topics in the Text
The data we collect from Twitter quickly grows to immense proportions. In fact,
they grow so large that attempting to read each individual Tweet quickly becomes
a hopeless cause. A more reachable goal is to get a high-level understanding of
what our users are talking about. One way to do this is by understanding the topics
the users are discussing in their Tweets. In this section we discuss the automatic
discovery of topics in the text through “topic modeling” with latent Dirichlet
allocation (LDA), a popular topic modeling algorithm.
4.2.1.1
What Is a Topic?
Every topic in LDA is a collection of words. Each topic contains all of the words
in the corpus with a probability of the word belonging to that topic. So, while all
of the words in the topic are the same, the weight they are given differs between
topics. For example, we may find a topic related to sports that is made up of 40%
“basketball”, 35% “football”, 15% “baseball”, ..., 0.02% “congress”, and 0.01%
“Obama”. Another topic related to politics could be made up of 35% “congress”,
30% “Obama”, ..., 1% “football”, 0.1% “baseball”, 0.1% “basketball”. Because
each topic contains every word we will only view the top words when inspecting a
topic.
LDA finds the most probable words for a topic, associating each topic with a
theme is left to the user. An example topic from the Occupy Wall Street data is
shown in Table 4.1 .
 
Search WWH ::




Custom Search