Database Reference
In-Depth Information
From above discussion, we can see that a data mining case consists of five parts:
the task, the data, the operator, the model, and the processing flow. Here, we will
further define the detailed contents of every part. As shown in Figure 2, data mining
case is defined with tree structure in several levels. The first level has included the
task, the data, the operator, the model and the processing flow part. In the following,
we will concern other levels' contents.
To data mining task, as mentioned before, it includes the elements of industry type,
problem type, business objective, data mining goal, company name, and department
name. Among them, the first four elements are used for similarity assessment, while
the later two elements are used for case grouping.
To the data in this data mining case, what we include is the information about data
storage and metadata. The general situation about data storage is that the data are stored
in a database or a data warehouse, whereas the data contain many tables, and a table
contains many fields. Based on this situation, we describe the data with more three
levels: the first level corresponds to the data (a set of tables), the second level
corresponds to the table, and the third level corresponds to the field. At each level, there
are many other elements, such as, the name, the type, and so on. In a data mining case,
the original data and the intermediate data generated in the data mining process all are
stored. So, in a data mining case, there are several data description parts.
To the operator in data mining case, it has the elements such as its path, name,
category, function, input, parameters, output, and guideline. Here, the operator
guideline is used to record the reusable knowledge concerned with the context of an
operator on such question as why this operator is required. Furthermore, different
operator has different parameters. Thus, we separate operator parameter from operator
itself and define it as next level elements. Operator parameter includes the elements
such as its name, type, value type, and so on. Among them, the parameter guideline is
an important part. It is used to record the reusable knowledge concerned with the
internal issues of an operator on such questions as what parameters are required, and
how to set their values under certain conditions.
To the model generated in data mining process, we will not define the model's
representation format. We just use PMML [5] language to represent model. PMML
has become an industry standard. So, in data mining case, the model includes the
elements of model type, model parameter, PMML code path, and PMML code name.
Finally, the processing flow describes connective relations of the data, the
operators, and the model(s). So, the numbers of the data, operators, and models have
been included as the elements of processing flow. The most important part of the
processing flow is connections. A connection has an input ID, an operator ID, and an
output ID. A data mining case only has one processing flow, and correspondingly, a
processing flow corresponds to a data mining case.
As to case representation, we use XML to represent our data mining case. XML is
easy to extend and exchange. In our work, the corresponding data mining case
representation language (DMCRL) has been defined. The XML-based DMCRL is
easy to extend to represent all kinds of data mining cases and easy to integrate with
PMML.
Search WWH ::




Custom Search