Database Reference
In-Depth Information
the frequency and commonality of strong words or grams across groups of documents. This can
reveal trends in the text, such as what topics are most important to author(s), or what message
should be taken away from the text when reading the documents.
Further, once the documents' tokens are organized into attributes, the documents can be modeled,
just as other, more structured data sets can be modeled. Multiple documents can be handled by a
single Process Document operator in RapidMiner, which will apply the same set of tokenization
and token handlers to all documents at once through the sub-process stream. After a model has
been applied to a set of documents, additional documents can be added to the stream, passed
through the document processor, and run through the model to yield more well-trained and
specific results.
REVIEW QUESTIONS
1) What are some of the benefits of text mining as opposed to the other models you've
learned in this topic?
2) How are some ways that text-based data is imported into RapidMiner?
3) What is a sub-process and when do you use one in RapidMiner?
4) Define the following terms: token, stem, n-gram, case-sensitive.
5) How does tokenization enable the application of data mining models to text-based data?
6) How do you view a k-Means cluster's details?
EXERCISE
For this chapter's exercise, you will mine text for common complaints against a company or
industry. Complete the following steps.
 
 
Search WWH ::




Custom Search