Database Reference
In-Depth Information
realistic. Perhaps you don't need to throw out the data mining model entirely, but for the next run
of that model you should be sure to change it to either remove observations with missing values,
or use a more appropriate replacement value based upon what you have learned. Even if you used
your data mining results and had excellent outcomes, remember that your business is constantly
moving, and through the day-to-day operations of your organization, you are gathering more data.
Be sure to add this data to training data sets, compare actual outcomes to predictions, and tune
your data mining models in accordance with your experience and the expertise you are developing.
Consider Sarah, our hypothetical sales manager from Chapters 4 and 8. Certainly now that we've
helped her predict heating oil usage by home through a linear regression model, Sarah can track
these homes' actual heating oil orders to see how well their actual use matches our predictions.
Once these customers have established several months or years of actual heating oil consumption,
their data can be fed into Sarah's model's training data set, helping it to be even more accurate in
its predictions.
One of the benefits of connecting RapidMiner to a database or data warehouse, rather than
importing data via a file (CSV, etc.) is that data can be added to the data sets in real time and fed
straight into the RapidMiner models. If you were to acquire some new training data, as Sarah
could in the scenario just proposed in the previous paragraph, it could be immediately
incorporated into the RapidMiner model if the data were in a connected database. With a CSV file,
the new training data would have to be added into the file, and then re-imported into the
RapidMiner repository.
As we tune and hone our models, they perform better for us. In addition to using our growing
expertise and adding more training data, there are some built-in ways that we can check a model's
performance in RapidMiner.
LEARNING OBJECTIVES
After completing the reading and exercises in this chapter, you should be able to:
Explain what cross-validation is, and discuss its role in the Evaluation and Deployment
phases of CRISP-DM.
Define false positives and explain why their existence is not all bad in data mining.
Perform a cross-validation on a training data set in RapidMiner.
 
Search WWH ::




Custom Search