Databases Reference
In-Depth Information
years of search. The same is true with data mining. It takes work, but hopefully
not months or years.
In this topic, we present a methodology. VisMiner is designed to support and
streamline the methodology. The methodology consists of four steps:
Initial data exploration - conduct an initial exploration of the data to gain
an overall understanding of its size and characteristics, looking for clues that
should be explored in more depth.
Dataset preparation - prepare the data for analysis.
Algorithm application - select and apply data mining algorithms to the
dataset.
Results evaluation - evaluate the results of the algorithm applications,
assessing the “goodness of fit” of the data to the algorithm results and
assessing the nature and strengths of inputs to the algorithm outputs.
These steps are not necessarily sequential in nature, but should be considered
as an iterative process progressing towards the end result - a complete and
thorough analysis. Some of the steps may even be completed in parallel. This is
true for “Initial data exploration” and “dataset preparation”. In VisMiner for
example, interactive visualizations designed primarily for the initial data
exploration also support some of the dataset preparation tasks.
In the sections that follow, we elaborate on the tasks to be completed in each
of the steps. In later chapters, problems and exercises are presented that guide
you through completion of these tasks using VisMiner. Throughout the topic,
reference is made back to the task descriptions introduced here. It is suggested
that as you work through the problems and exercises, you refer back to this list.
Use it as a reminder of what has and has not been completed.
Initial data exploration
The primary objective of initial data exploration is to help the analyst gain an
overall understanding of the dataset. This includes:
Dataset size and format - Determine the number of observations in the
dataset. How much space does it occupy? In what format is it stored?
Possible formats include tab or comma delimited text files, fixed field text
files, tables in a relational database, and pages in a spreadsheet. Since most
datasets stored in a relational database are encoded in the proprietary format
of the database management system used to store the data, check that you
have access to software that can retrieve and manipulate the content. Look
also at the number of tables containing data of interest. If found in multiple
tables, determine how they are linked and how they might be joined.
 
Search WWH ::




Custom Search