Databases Reference
In-Depth Information
identified using the techniques discussed in Chapter 6. Many of the shortfalls
of the current data are found during data profiling and/or detailed data
analysis performed while data modeling. If the data is not captured anywhere,
then it cannot be included in the database. Additional data challenges are
discovered during the development of the ETL system.
It takes a lot of detailed, tedious work to track down and resolve all of these
individual data issues. It is important to ensure that issues are well understood
so that decisions can be made about how to deal with problems that arise. In
some cases, getting to the bottom of the problem itself may take a lot more
research. A decision must be made whether to work on the problem or to
postpone it for the future. This must be a joint business and technical decision.
Some problems can be put off with little or no immediate impact, but some
data issues must be resolved in order to meet the overall objectives of the
project.
For example, suppose the organization has been collecting customer demo-
graphic data for years. When customers call in, they are asked if they are
willing to complete a short survey. This short survey collects additional demo-
graphics about each customer household. While it sounds interesting to use
for analysis, most customers did not participate, so only 15% of the customers
have any data. To make matters worse, the entry screens required the answers
to be keyed in, rather than using a set list of options, so the data that has been
collected has many different values and will require a lot of cleaning to make
it useful. The question at hand is whether this is worth the effort.
Because the demographic analysis is not an immediate priority, and the
work required is significant, this was postponed to a subsequent iteration.
In the meantime, a better data solution is to modify the survey entry screen
to capture pre-set options so that the data is consistent. In addition, the top
five most important questions need to be included in the initial conversation
with the customer, rather than as an optional survey. These decisions need
to be based on a cost-benefit analysis — not a multi-week effort, but simply a
checkpoint to ensure that resources are used wisely to deliver the most value
in a timely manner.
Often, the data problems identified when working on a data warehouse
project are data quality problems in the underlying source systems and/or
business processes. It is important to dig down to find out the root cause of
data quality problems. Then, decisions can be made to eliminate the problems
from recurring.
Discovering the Flaws in Your Current Systems
As a by-product of the detailed work that is done to extract and then transform
the data, many anomalies and unusual data handling and storage techniques
are uncovered. Sometimes fundamental flaws are identified regarding how
Search WWH ::




Custom Search