Evaluation and Deployment - Data Mining for the Masses

Database Reference

In-Depth Information

realistic. Perhaps you don't need to throw out the data mining model entirely, but for the next run

of that model you should be sure to change it to either remove observations with missing values,

or use a more appropriate replacement value based upon what you have learned. Even if you used

your data mining results and had excellent outcomes, remember that your business is constantly

moving, and through the day-to-day operations of your organization, you are gathering more data.

Be sure to add this data to training data sets, compare actual outcomes to predictions, and tune

your data mining models in accordance with your experience and the expertise you are developing.

Consider Sarah, our hypothetical sales manager from Chapters 4 and 8. Certainly now that we've

helped her predict heating oil usage by home through a linear regression model, Sarah can track

these homes' actual heating oil orders to see how well their actual use matches our predictions.

Once these customers have established several months or years of actual heating oil consumption,

their data can be fed into Sarah's model's training data set, helping it to be even more accurate in

its predictions.

One of the benefits of connecting RapidMiner to a database or data warehouse, rather than

importing data via a file (CSV, etc.) is that data can be added to the data sets in real time and fed

straight into the RapidMiner models. If you were to acquire some new training data, as Sarah

could in the scenario just proposed in the previous paragraph, it could be immediately

incorporated into the RapidMiner model if the data were in a connected database. With a CSV file,

the new training data would have to be added into the file, and then re-imported into the

RapidMiner repository.

As we tune and hone our models, they perform better for us. In addition to using our growing

expertise and adding more training data, there are some built-in ways that we can check a model's

performance in RapidMiner.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:

 Explain what cross-validation is, and discuss its role in the Evaluation and Deployment

phases of CRISP-DM.

 Define false positives and explain why their existence is not all bad in data mining.

 Perform a cross-validation on a training data set in RapidMiner.

Search WWH ::

Custom Search

Home