Java Reference
In-Depth Information
3.4
The Role of Databases and Data Warehouses in Data Mining
An abundant source of data for mining is found in relational data-
bases and data warehouses. Historically, data mining tools have
focused on data contained in flat files. Flat files, however, can be dif-
ficult to manage and control. Database management systems
(DBMSs) offer query capabilities, security, correctness guarantees,
and access control, among other features. In addition, DBMSs, like
some other tools, provide metadata control for data tables, explicitly
capturing column names, data types, and comments as part of the
table definition. Today, virtually all data mining tools are capable of
accessing data stored in commercial relational databases, such as
Oracle Database, IBM DB2, and Microsoft SQL Server. Querying and
transformations, using languages like SQL, facilitate analyzing data
and preparing it for mining. Multiple tables can be joined easily to
produce a single table—often a key step in data preparation.
Whereas individual databases readily support data mining
needs, large organizations often have dozens, hundreds, or even
thousands of databases spread across geographical locations and
numerous operational systems. The designers of these databases
may have adopted local conventions for schema design, naming,
values used in columns, etc. As such, trying to integrate data from
multiple databases to obtain a global picture becomes quite a chal-
lenge. A data warehouse [Inmon 1995, Ponniah 2001] is designed to
provide a common view of all the relevant data in an organization.
Here, relevant is defined as whatever is needed to run the business
more effectively or provide management insight into the health of
the business. For example, businesses implementing a customer-
oriented data warehouse strive to get a “360-degree” view of their
customers (i.e., being able to see all aspects of a customer's interac-
tion with the business). Designing a data warehouse with advanced
analytics and data mining requirements in mind can greatly
increase the value of the data warehouse and reduce efforts to
extract knowledge once completed.
When building a data warehouse, most database vendors provide
ETL tools for collecting and cleaning data. However, data mining can
be leveraged in the creation of a data warehouse to identify data
quality issues, or populate missing values. To populate missing val-
ues, the user can build predictive models from the other associated
attributes, using what is called value imputation . Data mining can also
be used to determine what data should be given special attention for
accuracy due to its value in model building and scoring. Further, it
Search WWH ::




Custom Search