Database Reference
In-Depth Information
for analytics. There is not one particular tool—or class of tools—that solves the prob-
lem of aggregating all the data in a company, and every company has unique problems
that make a general solution difficult. Nonetheless, there exists a huge industry of
solution providers and the requisite tech media to discuss the pros and cons of various
attempts at providing solutions.
Although the term “data warehouse” might mean different things in different con-
texts, the goal of providing a single unified place for organizational data is a common
use case. Organizations need ways to ask questions about large datasets that are afford-
able and timely and often need to be able to deal with data from a variety of sources,
each with different schemas and growth patterns. A common source of data comes
from operational data stores, which are often relational databases that live on the
frontlines of the organization, collecting and processing customer data at all times.
These databases are optimized to handle transactions such as the creation and updat-
ing of individual records but not to run large queries over the entire database. Opera-
tional data stores are often the canonical source of data for an organization, a source of
truth that changes from moment to moment. Moving snapshots of this data to a data
warehouse can provide insight into sales and advertising trends. Operational data isn't
the only source organizations are interested in analyzing. There is an ever-growing
amount of unstructured organizational data, such as the text from social media, blog
posts, and emails.
Analysts need to understand how data from all of these different sources is related.
For example, you might want to explore possible statistical relationships between sig-
nals in social media and sales of a particular item. One way to accomplish this goal
is to pipe all organizational data into a central relational database such as MySQL,
Microsoft SQL Server, Oracle, or any of the available commercial or open-source
products. This secondary data store can be maintained indefinitely as an organization
data record. In other cases, it might be used simply as a means to analyze data that has
been aggregated.
The process of taking data from the operational databases and converting it into a
form useful for aggregation in a data warehouse is often called ETL , which stands for
“extract, transform, and load.” Some data needs to be normalized , meaning variables
and some data types require conversion into other types. Building ETL pipelines can
be a complicated task, and there are many products available that are specialized for
this purpose.
There are a few drawbacks to this “traditional” approach to data analytics. The first
is simply cost. The market for many commercial data warehousing solutions is often
large enterprise customers, and can be expensive for many smaller organizations. Once
data is aggregated, analysts want to be able to ask questions about data in the ware-
house. However, relational databases are usually not the fastest way to query when data
sizes grow. When data sizes are relatively small, aggregate queries run on well-tuned
relational databases with proper indexing can be screamingly fast. However, no matter
how powerful the machine, relational databases by nature are ultimately challenging to
deal with as data sizes grow very large. Running complicated aggregate queries over
 
Search WWH ::




Custom Search