Analyzing Twitter Data - Twitter Data Analytics

Database Reference

In-Depth Information

Table 4.1 An example of the

most significant words from

one topic from LDA

Word

Probability (%)

ows

8.0

nypd

2.0

occupywallstreet

1.0

park

1.0

occupi

1.0

protest

1.0

nyc

1.0

evict

1.0

citi

1.0

polic

1.0

zuccotti

1.0

...

4.2.1

Finding Topics in the Text

The data we collect from Twitter quickly grows to immense proportions. In fact,

they grow so large that attempting to read each individual Tweet quickly becomes

a hopeless cause. A more reachable goal is to get a high-level understanding of

what our users are talking about. One way to do this is by understanding the topics

the users are discussing in their Tweets. In this section we discuss the automatic

discovery of topics in the text through “topic modeling” with latent Dirichlet

allocation (LDA), a popular topic modeling algorithm.

4.2.1.1

What Is a Topic?

Every topic in LDA is a collection of words. Each topic contains all of the words

in the corpus with a probability of the word belonging to that topic. So, while all

of the words in the topic are the same, the weight they are given differs between

topics. For example, we may find a topic related to sports that is made up of 40%

“basketball”, 35% “football”, 15% “baseball”, ..., 0.02% “congress”, and 0.01%

“Obama”. Another topic related to politics could be made up of 35% “congress”,

30% “Obama”, ..., 1% “football”, 0.1% “baseball”, 0.1% “basketball”. Because

each topic contains every word we will only view the top words when inspecting a

topic.

LDA finds the most probable words for a topic, associating each topic with a

theme is left to the user. An example topic from the Occupy Wall Street data is

shown in Table 4.1 .

Twitter Data Analytics

Search WWH ::

Custom Search

Home