Database Reference
In-Depth Information
12) Switch back to design perspective. You will see that we return to the sub-process from
where we ran the model. We've put the words from our documents into attributes through
tokenization, but further processing is needed to make sense of the value of the words in
relation to one another. For one thing, there are some words in our data set that really
don't mean much. These are necessary conjunctions and articles that make the text
readable in English, but that won't tell us much about meaning or authorship. We should
remove these words. In the Operators search field, look for the word 'Stop'. These types
of words are called stopwords , and RapidMiner has built-in dictionaries in several
languages to find and filter these out. Add the Filter Stopwords (English) operator to the
sub-process stream.
Figure 12-12. Removing stopwords such as 'and', 'or', 'the', etc. from our model.
13) In some instances, letters that are uppercase will not match with the same letters in
lowercase. When text mining, this could be a problem because 'Data' might be interpreted
different from 'data'. This is known as Case Sensitivity . We can address this matter by
adding a Transform Cases operator to our sub-process stream. Search for this operator
in the Operators tab and drag it into your stream, as shown in Figure 12-13.
Search WWH ::




Custom Search