Database Reference
In-Depth Information
was there a standard method of collection? What do the various columns and rows of data mean?
Are there acronyms or abbreviations that are unknown or unclear? You may need to do some
research in the Data Preparation phase of your data mining activities. Sometimes you will need to
meet with subject matter experts in various departments to unravel where certain data came from,
how they were collected, and how they have been coded and stored. It is critically important that
you verify the accuracy and reliability of the data as well. The old adage “It's better than nothing”
does not apply in data mining. Inaccurate or incomplete data could be worse than nothing in a
data mining activity, because decisions based upon partial or wrong data are likely to be partial or
wrong decisions. Once you have gathered, identified and understood your data assets, then you
may engage in…
CRISP-DM Step 3: Data Preparation
Data come in many shapes and formats. Some data are numeric, some are in paragraphs of text,
and others are in picture form such as charts, graphs and maps. Some data are anecdotal or
narrative, such as comments on a customer satisfaction survey or the transcript of a witness's
testimony. Data that aren't in rows or columns of numbers shouldn't be dismissed though—
sometimes non-traditional data formats can be the most information rich. We'll talk in this topic
about approaches to formatting data, beginning in Chapter 2. Although rows and columns will be
one of our most common layouts, we'll also get into text mining where paragraphs can be fed into
RapidMiner and analyzed for patterns as well.
Data Preparation involves a number of activities. These may include joining two or more data
sets together, reducing data sets to only those variables that are interesting in a given data mining
exercise, scrubbing data clean of anomalies such as outlier observations or missing data, or re-
formatting data for consistency purposes. For example, you may have seen a spreadsheet or
database that held phone numbers in many different formats:
(555) 555-5555
555/555-5555
555-555-5555
555.555.5555
555 555 5555
5555555555
Each of these offers the same phone number, but stored in different formats. The results of a data
mining exercise are most likely to yield good, useful results when the underlying data are as
Search WWH ::




Custom Search