Database Reference
In-Depth Information
modeling of data for warehousing is a specialized skill that requires a great deal of
training and domain knowledge. The main point here is that it takes a lot of effort to
gather and process operational data into a form that satisfies the dimensional hierarchy
of the data warehouse. Furthermore, these structures must be well planned, as there
can be a significant cost to changes in the schemas used in data warehousing software.
Data Warehousing's Magic Words: Extract, Transform, and Load
Facilitating the movement, processing, and storage from operational databases into a
data warehouse often involves a collection of tasks known as Extract, Transform, and
Load or, more commonly, ETL . Like most of the terms in this chapter, this phrase
belies the underlying complexity of the actual process.
ETL processes encompass more than just copying data from one place to another.
Data must be extracted out of existing data stores. Once the raw data is available, it
must be cleaned and validated, meaning that data must meet a standard of integrity.
Erroneous data must be corrected or deleted. The application of business rules is
important as well. Is there private customer information that must be obfuscated or
deleted before moving it from one place to another? Once all of these possibly com-
plex steps are taken, data is then loaded into tables in the data warehousing application.
Generally, these processes are automated, but in large organizations a great deal of
effort must be applied to ensure that data integrity is consistent across each step in an
ETL pipeline.
Transforming and stuffing the data necessary to make business decisions often runs
into a problem when data grows large. As we've noted before, moving data across the
network can be slow, and as data sources grow, ETL processes can become increasingly
complicated.
Data warehousing is also not necessarily a great tool for asking questions quickly.
First of all, the entire ETL pipeline must be run before data can be queried. What
happens when a new dimension of the table is required? Because data warehous-
ing is so front heavy, changing the data model used may require a lot of effort. The
ETL process doesn't lend itself to constant change. Data is pulled from a variety of
sources—a large mainframe server, an online relational database, and even data created
in spreadsheets from the accounting department. If these data sources are interrupted
or changed, the automated ETL processes could fail.
Hadoop: The Elephant in the Warehouse
Traditional data warehousing technologies were born in a pre-Internet world. The
growth of the Internet has resulted in the exponential growth of several generations'
worth of Web, mobile, and sensor data. If the concepts of data warehousing seem to
represent an ideal but unattainable goal, then new technologies in data analytics are
often portrayed as disruptive to the concepts of the data warehouse.
 
 
Search WWH ::




Custom Search