A Case-Based Data Mining Platform - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

From above discussion, we can see that a data mining case consists of five parts:

the task, the data, the operator, the model, and the processing flow. Here, we will

further define the detailed contents of every part. As shown in Figure 2, data mining

case is defined with tree structure in several levels. The first level has included the

task, the data, the operator, the model and the processing flow part. In the following,

we will concern other levels' contents.

To data mining task, as mentioned before, it includes the elements of industry type,

problem type, business objective, data mining goal, company name, and department

name. Among them, the first four elements are used for similarity assessment, while

the later two elements are used for case grouping.

To the data in this data mining case, what we include is the information about data

storage and metadata. The general situation about data storage is that the data are stored

in a database or a data warehouse, whereas the data contain many tables, and a table

contains many fields. Based on this situation, we describe the data with more three

levels: the first level corresponds to the data (a set of tables), the second level

corresponds to the table, and the third level corresponds to the field. At each level, there

are many other elements, such as, the name, the type, and so on. In a data mining case,

the original data and the intermediate data generated in the data mining process all are

stored. So, in a data mining case, there are several data description parts.

To the operator in data mining case, it has the elements such as its path, name,

category, function, input, parameters, output, and guideline. Here, the operator

guideline is used to record the reusable knowledge concerned with the context of an

operator on such question as why this operator is required. Furthermore, different

operator has different parameters. Thus, we separate operator parameter from operator

itself and define it as next level elements. Operator parameter includes the elements

such as its name, type, value type, and so on. Among them, the parameter guideline is

an important part. It is used to record the reusable knowledge concerned with the

internal issues of an operator on such questions as what parameters are required, and

how to set their values under certain conditions.

To the model generated in data mining process, we will not define the model's

representation format. We just use PMML [5] language to represent model. PMML

has become an industry standard. So, in data mining case, the model includes the

elements of model type, model parameter, PMML code path, and PMML code name.

Finally, the processing flow describes connective relations of the data, the

operators, and the model(s). So, the numbers of the data, operators, and models have

been included as the elements of processing flow. The most important part of the

processing flow is connections. A connection has an input ID, an operator ID, and an

output ID. A data mining case only has one processing flow, and correspondingly, a

processing flow corresponds to a data mining case.

As to case representation, we use XML to represent our data mining case. XML is

easy to extend and exchange. In our work, the corresponding data mining case

representation language (DMCRL) has been defined. The XML-based DMCRL is

easy to extend to represent all kinds of data mining cases and easy to integrate with

PMML.

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home