Database Reference
In-Depth Information
Coordinate (UTC) time is also meaningless. Somehow, Web server time must be adjusted to
the time zone of the customer.
Another problem is nonintegrated data . Suppose, for example, that an organization
wants to report on customer orders and payment behavior. Unfortunately, order data are
stored in a Microsoft Dynamics CRM system, whereas payment data are recorded in an Oracle
PeopleSoft financial management database. To perform the analysis, the data must somehow
be integrated.
The next problem is that data can be inappropriately formatted. First, data can be too
fine. For example, suppose, that we want to analyze the placement of graphics and controls
on an order entry Web page. It is possible to capture the customers' clicking behavior in what
is termed click-stream data . However, click-stream data include everything the customer
does. In the middle of the order stream, there may be data for clicks on the news, e-mail,
instant chat, and the weather. Although all of this data might be useful for a study of con-
sumer computer behavior, it will be overwhelming if all we want to know is how customers
respond to an ad located on the screen. Because the data are too fine, the data analysts must
throw millions and millions of clicks away before they can proceed.
Data can also be too coarse. A file of order totals cannot be used for a market basket anal-
ysis, which identifies items that are commonly purchased together. Market basket analyses
require item-level data; we need to know which items were purchased with which others. This
doesn't mean the order total data are useless; they can be adequate for other analyses, but they
just won't do for a market basket analysis.
If the data are too fine, they can be made coarser by summing and combining. An analyst
and a computer can sum and combine such data. If the data are too coarse, however, they can-
not be separated into their constituent parts.
The final problem listed in Figure 12-5 concerns the issue of too much data. We can have
an excess of columns, rows, or both. To illustrate the problem of too many columns (a syn-
onym for attributes), suppose that we want to know the attributes that influence customers'
responses to a promotion. Between customer data stored within the organization and cus-
tomer data that can be purchased, we might have a hundred or more different attributes, or
columns, to consider. How do we select among them? Because of a phenomenon called the
curse of dimensionality , the more attributes there are, the easier it is to build a model that
fits the sample data but that is worthless as a predictor. For this and other reasons, the number
of attributes should be reduced, and one of the major activities in data mining concerns the
efficient and effective selection of variables.
Finally, we may have too many instances, or rows, of data. Suppose that we want to
analyze click-stream data on CNN.com. How many clicks does this site receive per month?
Millions upon millions! To meaningfully analyze such data, we need to reduce the number of
instances. A good solution to this problem is statistical sampling. However, developing a reli-
able sample requires specialized expertise and information system tools.
Purchasing Data for Vendors
Data warehouses often include data that are purchased from outside sources. A typical exam-
ple is customer credit data. Figure 12-6 lists some of the consumer data than can be purchased
from the KBM Group in their AmeriLINK database of consumer data ( www.kbmg.com/services-
expertise/data/data-sourcing/datacard-search-and-listings/ ). An amazing, and from a privacy
standpoint frightening, amount of data is available just from this one vendor.
Data Warehouses Versus Data Marts
You can think of a data warehouse as a distributor in a supply chain. The data warehouse
takes data from the data manufacturers (operational systems and purchased data), cleans
and processes them, and locates the data on the shelves, so to speak, of the data warehouse.
The people who work in a data warehouse are experts at data management, data cleaning,
data transformation, and the like. However, they are not usually experts in a given business
function.
A data mart is a collection of data that is smaller than that in the data warehouse and
that addresses a particular component or functional area of the business. A data mart is like
a retail store in a supply chain. Users in the data mart obtain data that pertain to a particular
 
Search WWH ::




Custom Search