Big Data, Data Warehouses, and Business Intelligence Systems - Database Processing: Fundamentals, Design, and Implementation

Database Reference

In-Depth Information

Coordinate (UTC) time is also meaningless. Somehow, Web server time must be adjusted to

the time zone of the customer.

Another problem is nonintegrated data . Suppose, for example, that an organization

wants to report on customer orders and payment behavior. Unfortunately, order data are

stored in a Microsoft Dynamics CRM system, whereas payment data are recorded in an Oracle

PeopleSoft financial management database. To perform the analysis, the data must somehow

be integrated.

The next problem is that data can be inappropriately formatted. First, data can be too

fine. For example, suppose, that we want to analyze the placement of graphics and controls

on an order entry Web page. It is possible to capture the customers' clicking behavior in what

is termed click-stream data . However, click-stream data include everything the customer

does. In the middle of the order stream, there may be data for clicks on the news, e-mail,

instant chat, and the weather. Although all of this data might be useful for a study of con-

sumer computer behavior, it will be overwhelming if all we want to know is how customers

respond to an ad located on the screen. Because the data are too fine, the data analysts must

throw millions and millions of clicks away before they can proceed.

Data can also be too coarse. A file of order totals cannot be used for a market basket anal-

ysis, which identifies items that are commonly purchased together. Market basket analyses

require item-level data; we need to know which items were purchased with which others. This

doesn't mean the order total data are useless; they can be adequate for other analyses, but they

just won't do for a market basket analysis.

If the data are too fine, they can be made coarser by summing and combining. An analyst

and a computer can sum and combine such data. If the data are too coarse, however, they can-

not be separated into their constituent parts.

The final problem listed in Figure 12-5 concerns the issue of too much data. We can have

an excess of columns, rows, or both. To illustrate the problem of too many columns (a syn-

onym for attributes), suppose that we want to know the attributes that influence customers'

responses to a promotion. Between customer data stored within the organization and cus-

tomer data that can be purchased, we might have a hundred or more different attributes, or

columns, to consider. How do we select among them? Because of a phenomenon called the

curse of dimensionality , the more attributes there are, the easier it is to build a model that

fits the sample data but that is worthless as a predictor. For this and other reasons, the number

of attributes should be reduced, and one of the major activities in data mining concerns the

efficient and effective selection of variables.

Finally, we may have too many instances, or rows, of data. Suppose that we want to

analyze click-stream data on CNN.com. How many clicks does this site receive per month?

Millions upon millions! To meaningfully analyze such data, we need to reduce the number of

instances. A good solution to this problem is statistical sampling. However, developing a reli-

able sample requires specialized expertise and information system tools.

Purchasing Data for Vendors

Data warehouses often include data that are purchased from outside sources. A typical exam-

ple is customer credit data. Figure 12-6 lists some of the consumer data than can be purchased

from the KBM Group in their AmeriLINK database of consumer data ( www.kbmg.com/services-

expertise/data/data-sourcing/datacard-search-and-listings/ ). An amazing, and from a privacy

standpoint frightening, amount of data is available just from this one vendor.

Data Warehouses Versus Data Marts

You can think of a data warehouse as a distributor in a supply chain. The data warehouse

takes data from the data manufacturers (operational systems and purchased data), cleans

and processes them, and locates the data on the shelves, so to speak, of the data warehouse.

The people who work in a data warehouse are experts at data management, data cleaning,

data transformation, and the like. However, they are not usually experts in a given business

function.

A data mart is a collection of data that is smaller than that in the data warehouse and

that addresses a particular component or functional area of the business. A data mart is like

a retail store in a supply chain. Users in the data mart obtain data that pertain to a particular

Database Processing: Fundamentals, Design, and Implementation

Search WWH ::

Custom Search

Home