Advanced Analytics—Technology and Tools: In-Database Analytics - Data Science and Big Data Analytics

Database Reference

In-Depth Information

The concept of Magnetic/Agile/Deep (MAD) analysis skills was introduced in a

2009 paper by Cohen, et al. [3]. This paper describes the components of MAD as

follows:

• Magnetic: Traditional Enterprise Data Warehouse (EDW) approaches

“repel” new data sources, discouraging their incorporation until they are

carefully cleansed and integrated. Given the ubiquity of data in modern

organizations, a data warehouse can keep pace today only by being

“magnetic”: attracting all the data sources that crop up within an

organization regardless of data quality niceties.

• Agile: Data Warehousing orthodoxy is based on long-range and careful

design and planning. Given growing numbers of data sources and

increasingly sophisticated and mission-critical data analyses, a modern

warehouse must instead allow analysts to easily ingest, digest, produce,

and adapt data rapidly. This requires a database whose physical and

logical contents can be in continuous rapid evolution.

• Deep: Modern data analyses involve increasingly sophisticated statistical

methods that go well beyond the rollups and drilldowns of traditional

business intelligence (BI). Moreover, analysts often need to see both the

forest and the trees in running these algorithms; they want to study

enormous datasets without resorting to samples and extracts. The modern

data warehouse should serve both as a deep data repository and as a

sophisticated algorithmic runtime engine.

In response to the inability of a traditional EDW to readily accommodate new

data sources, the concept of a data lake has emerged. A data lake represents an

environment that collects and stores large volumes of structured and unstructured

datasets, typically in their original, unaltered forms. More than a data depository,

the data lake architecture enables the various users and data science teams to

conduct data exploration and related analytical activities. Apache Hadoop is often

considered a key component of building a data lake [4].

Because MADlib is designed and built to accommodate massive parallel processing

of data, MADlib is ideal for Big Data in-database analytics. MADlib supports the

open-source database PostgreSQL as well as the Pivotal Greenplum Database and

Pivotal HAWQ. HAWQ is a SQL query engine for data stored in the Hadoop

Distributed File System (HDFS). Apache Hadoop and the Pivotal products were

described in Chapter 10, “Advanced Analytics—Technology and Tools: MapReduce

and Hadoop.”

Search WWH ::

Custom Search

Home