Text Mining - Data Mining for the Masses

Database Reference

In-Depth Information

12) Switch back to design perspective. You will see that we return to the sub-process from

where we ran the model. We've put the words from our documents into attributes through

tokenization, but further processing is needed to make sense of the value of the words in

relation to one another. For one thing, there are some words in our data set that really

don't mean much. These are necessary conjunctions and articles that make the text

readable in English, but that won't tell us much about meaning or authorship. We should

remove these words. In the Operators search field, look for the word 'Stop'. These types

of words are called stopwords , and RapidMiner has built-in dictionaries in several

languages to find and filter these out. Add the Filter Stopwords (English) operator to the

sub-process stream.

Figure 12-12. Removing stopwords such as 'and', 'or', 'the', etc. from our model.

13) In some instances, letters that are uppercase will not match with the same letters in

lowercase. When text mining, this could be a problem because 'Data' might be interpreted

different from 'data'. This is known as Case Sensitivity . We can address this matter by

adding a Transform Cases operator to our sub-process stream. Search for this operator

in the Operators tab and drag it into your stream, as shown in Figure 12-13.

Search WWH ::

Custom Search

Home