Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

for analytics. There is not one particular tool—or class of tools—that solves the prob-

lem of aggregating all the data in a company, and every company has unique problems

that make a general solution difficult. Nonetheless, there exists a huge industry of

solution providers and the requisite tech media to discuss the pros and cons of various

attempts at providing solutions.

Although the term “data warehouse” might mean different things in different con-

texts, the goal of providing a single unified place for organizational data is a common

use case. Organizations need ways to ask questions about large datasets that are afford-

able and timely and often need to be able to deal with data from a variety of sources,

each with different schemas and growth patterns. A common source of data comes

from operational data stores, which are often relational databases that live on the

frontlines of the organization, collecting and processing customer data at all times.

These databases are optimized to handle transactions such as the creation and updat-

ing of individual records but not to run large queries over the entire database. Opera-

tional data stores are often the canonical source of data for an organization, a source of

truth that changes from moment to moment. Moving snapshots of this data to a data

warehouse can provide insight into sales and advertising trends. Operational data isn't

the only source organizations are interested in analyzing. There is an ever-growing

amount of unstructured organizational data, such as the text from social media, blog

posts, and emails.

Analysts need to understand how data from all of these different sources is related.

For example, you might want to explore possible statistical relationships between sig-

nals in social media and sales of a particular item. One way to accomplish this goal

is to pipe all organizational data into a central relational database such as MySQL,

Microsoft SQL Server, Oracle, or any of the available commercial or open-source

products. This secondary data store can be maintained indefinitely as an organization

data record. In other cases, it might be used simply as a means to analyze data that has

been aggregated.

The process of taking data from the operational databases and converting it into a

form useful for aggregation in a data warehouse is often called ETL , which stands for

“extract, transform, and load.” Some data needs to be normalized , meaning variables

and some data types require conversion into other types. Building ETL pipelines can

be a complicated task, and there are many products available that are specialized for

this purpose.

There are a few drawbacks to this “traditional” approach to data analytics. The first

is simply cost. The market for many commercial data warehousing solutions is often

large enterprise customers, and can be expensive for many smaller organizations. Once

data is aggregated, analysts want to be able to ask questions about data in the ware-

house. However, relational databases are usually not the fastest way to query when data

sizes grow. When data sizes are relatively small, aggregate queries run on well-tuned

relational databases with proper indexing can be screamingly fast. However, no matter

how powerful the machine, relational databases by nature are ultimately challenging to

deal with as data sizes grow very large. Running complicated aggregate queries over

Data Just Right: Introduction to Large-Scale Data and Analytics

Search WWH ::

Custom Search

Home