Databases Reference
In-Depth Information
language designed specifically for Hadoop. Hive is an open source data warehouse
originally developed by Facebook that allows for analytic modeling within Hadoop.
Hadoop: The Pros and Cons
The main benefit of Hadoop is that it allows enterprises to process and analyze large
volumes of unstructured and semi-structured data, heretofore inaccessible to them,
in a cost- and time-effective manner. Because Hadoop clusters can scale to petabytes
and even exabytes of data, enterprises no longer must rely on sample data sets but can
process and analyze all relevant data. Data scientists can apply an iterative approach to
analysis, continually refining and testing queries to uncover previously unknown insights.
It is also inexpensive to get started with Hadoop. Developers can download the Apache
Hadoop distribution for free and begin experimenting with Hadoop in less than a day.
The downside to Hadoop and its myriad components is that they are immature
and still developing. As with any young technology, implementing and managing
Hadoop clusters and performing advanced analytics on large volumes of unstructured
data requires significant expertise, skill, and training. Unfortunately, there is currently
a dearth of Hadoop developers and data scientists available, making it impractical for
many enterprises to maintain and take advantage of complex Hadoop clusters. Further,
as the community improves upon Hadoop's myriad components and new components
are created, there is, as with any immature open source technology/approach, a risk of
forking. Finally, Hadoop is a batch-oriented framework, meaning it does not support
real-time data processing and analysis.
The Hadoop Suitability Test
Hadoop ecosystem plays an integral part in any big data implementation. Despite the
preference, not all enterprise use cases necessitate Hadoop as a must. How can we
objectively assess the suitability of Hadoop to a business problem?
Below we have outlined few guiding principles to help in assessing the
appropriateness of a Hadoop implementation with respect to the business problem.
Data Volume Consideration: Historical as well Incremental in a
scale of GB, TB, PB, EB
Data Type Consideration: structured, semi-structured,
unstructured
Data Integration and Interaction Mode Consideration: batch,
near real-time, real-time
Data Ingestion Pattern Consideration: streaming, non-event-driven
Data Design Consideration: local, distributed, centralized
Data Modeling Consideration: ER, normalized, de-normalized
Data Access and Data Manipulation Consideration: SQL,
NoSQL
 
Search WWH ::




Custom Search