Strategies for Dealing with Data Silos - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

of aggregated information, used primarily for reporting and analysis. Although the

various operational data systems are dealing with real-time transactions, such as pur-

chases and customer updates, a data warehousing solution is designed as a way to over-

come the problems created by these data silos.

Data warehouses are often described as subject oriented , because the design of the

data structures used is centered around asking questions about a particular subject or

concept. In other words, an operational database is designed to make sure that a cus-

tomer's contact information is correct and up to date. A data warehouse is concerned

with questions such as, “Over a certain period of time, which customers have made

in-store purchases and online purchases?”

Data warehouse tools are not meant to handle day-to-day updates of customer data.

An operational database might be consistently updated at any time. Meanwhile, a data

warehouse might be used to append a daily snapshot of data for analysis. As an exam-

ple, think of an airline reservation system. As customers make changes to their plans,

the reservation system is updated in real time accordingly. However, at the end of

each business day, a process might be defined that appends a snapshot of this data into

a warehouse application. Other data, perhaps from the airline's separate loyalty card

database, might be appended into the data warehouse as well. The next day, queries

can be run across this aggregated data to inform the airline about trends in frequent-

f lyer points over time.

Another basic requirement of a data warehouse is obviously to provide the ability to

build reports and ask questions about data. In order to pull in data from many different

sources and to quickly run queries, the data structure used in modern large-scale data

warehousing software requires careful modeling. One of the simplest ways to represent

data in a warehouse system is by using a star schema , so named because the diagram

of this schema resembles the points of a star. A record of data in a star schema is some-

what denormalized , meaning that data is replicated in various tables. In its simplest

form, a star schema represents data by using two types of tables: a fact table and one or

more dimension tables. This schema is best illustrated with an example. Let's consider

a single record of a customer transaction. In a star schema, this type of data might be

represented by a single fact table containing records of the customer ID, a product

ID, a cost, and a time of purchase. Within an organization, there's more to the data

than just this basic information. Each customer will also have associated contact infor-

mation, including an address, a phone number, and an email address. The product sold

has other metadata associated with it, as well, including its supplier information. These

additional types of data would be stored in dimension tables , with a key correspond-

ing to the fact table information. The additional information found in the dimension

tables completes the entire picture of data that encompasses the subject of a customer

purchase. The use of star schemas, and more complex data-modeling structures like

them, is meant to make querying as efficient as possible without the excessive joins

that might be found in a standard relational database.

The design and implementation of this type of schema, as well as more compli-

cated versions of this concept, goes well beyond the scope of this topic. In practice,

Data Just Right: Introduction to Large-Scale Data and Analytics

Search WWH ::

Custom Search

Home