Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Exercises

1. What are the main challenges of text analysis?

2. What is a corpus?

3. What are common words (such as a, and, of ) called?

4. Why can't we use TF alone to measure the usefulness of the words?

5. What is a caveat of IDF? How does TFIDF address the problem?

6. Name three benefits of using the TFIDF.

7. What methods can be used for sentiment analysis?

8. What is the definition of topic in topic models?

9. Explain the trade-offs for precision and recall.

10. Perform LDA topic modeling on the Reuters-21578 corpus using Python

and LDA. The NLTK has already come with the Reuters-21578 corpus. To

import this corpus, enter the following comment in the Python prompt:

from nltk.corpus import reuters

The LDA has already been implemented by several Python libraries such as

gensim [45]. Either use one such library or implement your own LDA to

perform topic modeling on the Reuters-21578 corpus.

11. Choose a topic of your interest, such as a movie, a celebrity, or any buzz

word. Then collect 100 tweets related to this topic. Hand-tag them as

positive, neutral, or negative. Next, split them into 80 tweets as the training

set and the remaining 20 as the testing set. Run one or more classifiers over

these tweets to perform sentiment analysis. What are the precision and

recall of these classifiers? Which classifier performs better than the others?

Search WWH ::

Custom Search

Home