Introduction to Decision Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

steps (note that some of the methods here are similar to Data Mining

algorithms, but these are used in the preprocessing context).

2. Creating a dataset on which discovery will be performed.

Having defined the goals, the data that will be used for the knowl-

edge discovery should be determined. This step includes finding out

what data is available, obtaining additional necessary data and then

integrating all the data for the knowledge discovery into one dataset,

including the attributes that will be considered for the process. This

process is very important because the Data Mining learns and discovers

new patterns from the available data. This is the evidence base for

constructing the models. If some important attributes are missing, then

the entire study may fail. For a successful process it is good to consider

as many as possible attributes at this stage. However, collecting,

organizing and operating complex data repositories is expensive.

3. Preprocessing and cleansing. At this stage, data reliability is

enhanced. It includes data clearing, such as handling missing values and

removing noise or outliers. It may involve complex statistical methods,

or using specific Data Mining algorithm in this context. For example,

if one suspects that a certain attribute is not reliable enough or has

too much missing data, then this attribute could become the goal of a

data mining supervised algorithm. A prediction model for this attribute

will be developed and then, the missing value can be replaced with

the predicted value. The extent to which one pays attention to this

level depends on many factors. Regardless, studying these aspects is

important and is often insightful about enterprise information systems.

4. Data transformation. At this stage, the generation of better data for

the data mining is prepared and developed. One of the methods that

can be used here is dimension reduction, such as feature selection and

extraction as well as record sampling. Another method that one could

use at this stage is attribute transformation, such as discretization of

numerical attributes and functional transformation. This step is often

crucial for the success of the entire project, but it is usually very

project-specific. For example, in medical examinations, it is not the

individual aspects/characteristics that make the difference rather, it

is the quotient of attributes that often is considered to be the most

important factor. In marketing, we may need to consider effects beyond

our control as well as efforts and temporal issues such as, studying the

effect of advertising accumulation. However, even if we do not use the

Search WWH ::

Custom Search

Home