Text Mining - Data Mining for the Masses

Database Reference

In-Depth Information

the frequency and commonality of strong words or grams across groups of documents. This can

reveal trends in the text, such as what topics are most important to author(s), or what message

should be taken away from the text when reading the documents.

Further, once the documents' tokens are organized into attributes, the documents can be modeled,

just as other, more structured data sets can be modeled. Multiple documents can be handled by a

single Process Document operator in RapidMiner, which will apply the same set of tokenization

and token handlers to all documents at once through the sub-process stream. After a model has

been applied to a set of documents, additional documents can be added to the stream, passed

through the document processor, and run through the model to yield more well-trained and

specific results.

REVIEW QUESTIONS

1) What are some of the benefits of text mining as opposed to the other models you've

learned in this topic?

2) How are some ways that text-based data is imported into RapidMiner?

3) What is a sub-process and when do you use one in RapidMiner?

4) Define the following terms: token, stem, n-gram, case-sensitive.

5) How does tokenization enable the application of data mining models to text-based data?

6) How do you view a k-Means cluster's details?

EXERCISE

For this chapter's exercise, you will mine text for common complaints against a company or

industry. Complete the following steps.

Search WWH ::

Custom Search

Home