Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

CHAPTER

4

Introducing Big Data

Technologies

The first rule of any technology used in a business is that automation applied to an efficient operation will

magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the

inefficiency.

—Bill Gates

INTRODUCTION

The first three chapters provided you an introduction to Big Data, the complexities associated with

Big Data, and the processing techniques for Big Data. This chapter focuses on technologies that are

available today and have been architected and developed to process Big Data, and the different archi-

tectures that can be adopted for processing vast amounts of data. While no one technology is a deep

dive or a focus area, we have attempted to provide concise overviews of all the different technologies,

distributed data processing, and Big Data processing requirements needed to select and implement the

most appropriate Big Data technologies and architecture for your organization. We have referenced

several whitepapers and the Apache Foundation website, apart from discussions with Hadoop teams

at CloudEra and HortonWorks. The author thanks all those who provided time for these discussions.

Processing Big Data has several complexities in abstracted layers, as discussed in Chapter 3. We

can quantify this into a finite realm of a three-dimensional problem with processing this data type,

the dimensions being the volume of the data produced, the variety of formats, and the velocity of data

generation. To handle any of these problems in traditional data processing architecture is not a feasible

option. The problem by itself did not originate in the last decade and has been something that is being

solved by various architects, researchers, and organizations over the years. A simplified approach to

large data processing was to create distributed data processing architectures and manage the coordi-

nation by programming language techniques. This approach, while solving the volume requirement,

did not have the capability to handle the other two dimensions. With the advent of the Internet and

search engines, the need to handle the complex and diverse data became a necessity and not a one-

off requirement. It is during this time in the early 1990s that a slew of distributed data processing

papers and associated algorithms and techniques were published by Google; Stanford University;

Dr. Stonebraker; Eric Brewer; and Doug Cutting (Nutch Search Engine); and Yahoo, among others.

Today, the various architectures and papers that were contributed by these and other develop-

ers across the world have culminated into several open-source projects under the Apache Software

Foundation and the NoSQL movement. All of these technologies have been identified as Big Data pro-

cessing platforms, including Hadoop, Hive, HBase, Cassandra, and MapReduce. NoSQL platforms

45

Search WWH ::

Custom Search

Home