Application Architectures for Big Data and Analytics - Big Data Imperatives

Databases Reference

In-Depth Information

language designed specifically for Hadoop. Hive is an open source data warehouse

originally developed by Facebook that allows for analytic modeling within Hadoop.

Hadoop: The Pros and Cons

The main benefit of Hadoop is that it allows enterprises to process and analyze large

volumes of unstructured and semi-structured data, heretofore inaccessible to them,

in a cost- and time-effective manner. Because Hadoop clusters can scale to petabytes

and even exabytes of data, enterprises no longer must rely on sample data sets but can

process and analyze all relevant data. Data scientists can apply an iterative approach to

analysis, continually refining and testing queries to uncover previously unknown insights.

It is also inexpensive to get started with Hadoop. Developers can download the Apache

Hadoop distribution for free and begin experimenting with Hadoop in less than a day.

The downside to Hadoop and its myriad components is that they are immature

and still developing. As with any young technology, implementing and managing

Hadoop clusters and performing advanced analytics on large volumes of unstructured

data requires significant expertise, skill, and training. Unfortunately, there is currently

a dearth of Hadoop developers and data scientists available, making it impractical for

many enterprises to maintain and take advantage of complex Hadoop clusters. Further,

as the community improves upon Hadoop's myriad components and new components

are created, there is, as with any immature open source technology/approach, a risk of

forking. Finally, Hadoop is a batch-oriented framework, meaning it does not support

real-time data processing and analysis.

The Hadoop Suitability Test

Hadoop ecosystem plays an integral part in any big data implementation. Despite the

preference, not all enterprise use cases necessitate Hadoop as a must. How can we

objectively assess the suitability of Hadoop to a business problem?

Below we have outlined few guiding principles to help in assessing the

appropriateness of a Hadoop implementation with respect to the business problem.

• Data Volume Consideration: Historical as well Incremental in a

scale of GB, TB, PB, EB

• Data Type Consideration: structured, semi-structured,

unstructured

• Data Integration and Interaction Mode Consideration: batch,

near real-time, real-time

• Data Ingestion Pattern Consideration: streaming, non-event-driven

• Data Design Consideration: local, distributed, centralized

• Data Modeling Consideration: ER, normalized, de-normalized

• Data Access and Data Manipulation Consideration: SQL,

NoSQL

Search WWH ::

Custom Search

Home