Databases Reference
In-Depth Information
CHAPTER
4
Introducing Big Data
Technologies
The first rule of any technology used in a business is that automation applied to an efficient operation will
magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the
inefficiency.
—Bill Gates
INTRODUCTION
The first three chapters provided you an introduction to Big Data, the complexities associated with
Big Data, and the processing techniques for Big Data. This chapter focuses on technologies that are
available today and have been architected and developed to process Big Data, and the different archi-
tectures that can be adopted for processing vast amounts of data. While no one technology is a deep
dive or a focus area, we have attempted to provide concise overviews of all the different technologies,
distributed data processing, and Big Data processing requirements needed to select and implement the
most appropriate Big Data technologies and architecture for your organization. We have referenced
several whitepapers and the Apache Foundation website, apart from discussions with Hadoop teams
at CloudEra and HortonWorks. The author thanks all those who provided time for these discussions.
Processing Big Data has several complexities in abstracted layers, as discussed in Chapter 3. We
can quantify this into a finite realm of a three-dimensional problem with processing this data type,
the dimensions being the volume of the data produced, the variety of formats, and the velocity of data
generation. To handle any of these problems in traditional data processing architecture is not a feasible
option. The problem by itself did not originate in the last decade and has been something that is being
solved by various architects, researchers, and organizations over the years. A simplified approach to
large data processing was to create distributed data processing architectures and manage the coordi-
nation by programming language techniques. This approach, while solving the volume requirement,
did not have the capability to handle the other two dimensions. With the advent of the Internet and
search engines, the need to handle the complex and diverse data became a necessity and not a one-
off requirement. It is during this time in the early 1990s that a slew of distributed data processing
papers and associated algorithms and techniques were published by Google; Stanford University;
Dr. Stonebraker; Eric Brewer; and Doug Cutting (Nutch Search Engine); and Yahoo, among others.
Today, the various architectures and papers that were contributed by these and other develop-
ers across the world have culminated into several open-source projects under the Apache Software
Foundation and the NoSQL movement. All of these technologies have been identified as Big Data pro-
cessing platforms, including Hadoop, Hive, HBase, Cassandra, and MapReduce. NoSQL platforms
45
Search WWH ::




Custom Search