To Govern or Not to Govern: Governance in a Big Data World - Harness the Power of Big Data

Database Reference

In-Depth Information

Some developers believe that new technologies, such as Hadoop, can be

used for a multitude of tasks. From a batch and integration perspective,

the Big Data world is characterized by various approaches and disciplines

with Big Data technologies. This leads to a “build mentality,” which assumes

that everything can be built around the new technology. If you think back

to when the data warehouse industry was in its infancy, many IT professionals

attempted to build in-house integration capabilities. Few would do that

today, because mature information integration technologies exist. The same

pattern is playing out in Hadoop, with some believing that it should be the

sole component for integration or transformation workloads.

For example, some folks propose that they should only use Hadoop to

prepare data for a data warehouse; this is generally referred to as ETL. But

there's a huge gap between a general-purpose tool and a purpose-built one,

and integration involves many aspects other than the transformation of

data, such as extraction, discovery, profiling, metadata, data quality, and

delivery. Organizations shouldn't utilize Hadoop solely for integration;

rather they should leverage mature data integration technologies to help

speed their deployments of Big Data. New technologies such as Hadoop

will be adopted into data integration; for example, during an ELT-style inte-

gration (where the T may be performed by stored procedures in a data

warehouse), organizations may look to utilize Hadoop for transformation

processing. We think you'll find the need to use Hadoop engines as part of

an ETL/ELT strategy, but you will also greatly benefit from the flexibility of

a fit-for-purpose transformation engine, massively parallel integration

engine to support multiple transformation and load requirements, integra-

tion into common run-time environments, and a common design palette

that's provided by a product such as InfoSphere Information Server (IIS). In

fact, this product's parallel processing engine and end-to-end integration

and quality capabilities yield a significant total cost of ownership advantage

over alternative approaches.

For example, if the transformation is full of SQL operations, IIS can push

down those operations into an IBM PureData System for Analytics appliance

(formerly known as Netezza). Your integration platform should be able to

not just automatically generate jobs to run on an Hadoop infrastructure or

ETL parallel engine as required, but manage them with a common job

sequencer. IIS includes connectors into Hadoop and a Big Data file stage

Search WWH ::

Custom Search

Home