Database Reference
In-Depth Information
Exercises
1. What are the main challenges of text analysis?
2. What is a corpus?
3. What are common words (such as a, and, of ) called?
4. Why can't we use TF alone to measure the usefulness of the words?
5. What is a caveat of IDF? How does TFIDF address the problem?
6. Name three benefits of using the TFIDF.
7. What methods can be used for sentiment analysis?
8. What is the definition of topic in topic models?
9. Explain the trade-offs for precision and recall.
10. Perform LDA topic modeling on the Reuters-21578 corpus using Python
and LDA. The NLTK has already come with the Reuters-21578 corpus. To
import this corpus, enter the following comment in the Python prompt:
from nltk.corpus import reuters
The LDA has already been implemented by several Python libraries such as
gensim [45]. Either use one such library or implement your own LDA to
perform topic modeling on the Reuters-21578 corpus.
11. Choose a topic of your interest, such as a movie, a celebrity, or any buzz
word. Then collect 100 tweets related to this topic. Hand-tag them as
positive, neutral, or negative. Next, split them into 80 tweets as the training
set and the remaining 20 as the testing set. Run one or more classifiers over
these tweets to perform sentiment analysis. What are the precision and
recall of these classifiers? Which classifier performs better than the others?
Search WWH ::




Custom Search