Database Reference
In-Depth Information
The concept of Magnetic/Agile/Deep (MAD) analysis skills was introduced in a
2009 paper by Cohen, et al. [3]. This paper describes the components of MAD as
follows:
Magnetic: Traditional Enterprise Data Warehouse (EDW) approaches
“repel” new data sources, discouraging their incorporation until they are
carefully cleansed and integrated. Given the ubiquity of data in modern
organizations, a data warehouse can keep pace today only by being
“magnetic”: attracting all the data sources that crop up within an
organization regardless of data quality niceties.
Agile: Data Warehousing orthodoxy is based on long-range and careful
design and planning. Given growing numbers of data sources and
increasingly sophisticated and mission-critical data analyses, a modern
warehouse must instead allow analysts to easily ingest, digest, produce,
and adapt data rapidly. This requires a database whose physical and
logical contents can be in continuous rapid evolution.
Deep: Modern data analyses involve increasingly sophisticated statistical
methods that go well beyond the rollups and drilldowns of traditional
business intelligence (BI). Moreover, analysts often need to see both the
forest and the trees in running these algorithms; they want to study
enormous datasets without resorting to samples and extracts. The modern
data warehouse should serve both as a deep data repository and as a
sophisticated algorithmic runtime engine.
In response to the inability of a traditional EDW to readily accommodate new
data sources, the concept of a data lake has emerged. A data lake represents an
environment that collects and stores large volumes of structured and unstructured
datasets, typically in their original, unaltered forms. More than a data depository,
the data lake architecture enables the various users and data science teams to
conduct data exploration and related analytical activities. Apache Hadoop is often
considered a key component of building a data lake [4].
Because MADlib is designed and built to accommodate massive parallel processing
of data, MADlib is ideal for Big Data in-database analytics. MADlib supports the
open-source database PostgreSQL as well as the Pivotal Greenplum Database and
Pivotal HAWQ. HAWQ is a SQL query engine for data stored in the Hadoop
Distributed File System (HDFS). Apache Hadoop and the Pivotal products were
described in Chapter 10, “Advanced Analytics—Technology and Tools: MapReduce
and Hadoop.”
Search WWH ::




Custom Search