Database Reference
In-Depth Information
Using Hadoop
Hadoop is for data processing. You may ask "So are MATLAB, R, Octave, Python (NLTK
and many other libraries for data analysis), and SAS, then why Hadoop". They are great
tools, but they are good for data that can fit in memory. It means that you can churn a
couple of GBs to maybe 10s of GBs, and the rate of processing depends on the CPU on that
machine, maybe 16 cores. This poses a big restriction. The data is no more in GB limits at
the Internet scale. In the age of billions of mobile phones (there were an estimated 7.7 billi-
on mobile users at the end of 2014, source: http://mobithinking.com/mobile-marketing-
tools/latest-mobile-stats/a#subscribers ), we are generating humongous amounts of data
every second (Twitter reports 143,199 tweets per second, source: http://dazeinfo.com/2014/
04/29/7-7-billion-mobile-devices-among-7-1-billion-world-population-end-2014/ ) by
checking in places, tagging photos, uploading videos, commenting, messaging, purchasing,
dining, running (fitness apps monitor your activities), and many other activities that we do;
we literally record these events somewhere. It does not stop at organic data generation.
A lot of data, a lot more than organic data, is generated by machines ( ht-
tp://en.wikipedia.org/wiki/Machine-generated_data ). Web logs, financial market data, data
from various sensors (including ones in your cell phone), machine part data, and many
more are such examples. Health, genomics, and medical science have some of the most in-
teresting big data corpora ready to be analyzed and inferred. To give you a glimpse of how
big genetic data can be, we should check data from the 1,000 genome projects ( ht-
tp://www.1000genomes.org/ ). This data is available for free (there are storage charges) to
be used by anyone. The genome data for (only) 1,700 individuals makes a corpus of 200
terabytes. It is doubtful that any conventional in-memory computation tool such as R or
MATLAB can do it. Hadoop helps you process the data of that extent.
Hadoop is an example of distributed computing, so you can scale beyond a single com-
puter. Hadoop virtualizes the storage and processors. This means you can roughly treat a
10-machine Hadoop cluster as one machine with 10 times the processing power and 10
times the storage capacity than of a single one. With multiple machines parallely process-
ing the data, Hadoop is best fit for large unstructured datasets. It can help you clean data
(data munging) and perform data transformation too. HDFS provides redundant distributed
data storage. Effectively, it can work as your extract, transform, and load ( ETL ) plat-
form.
Search WWH ::




Custom Search