Data Transfer - Field Guide to Hadoop

Database Reference

In-Depth Information

Chapter 6. Data Transfer

Data transfer deals with three important questions:

▪ How do you get data into a Hadoop cluster?

▪ How do you get data out of a Hadoop cluster?

▪ How do you move data from one Hadoop cluster to another Hadoop cluster?

In general, Hadoop is not a transactional engine, where data is loaded in small, discrete, re-

lated bits of information like it would be in an airline reservation system. Instead, data is

bulk loaded from external sources such a flat files for sensors, bulk loads from sources like

http://www.data.gov for U.S. federal government data or log files, or transfers from relation-

al systems.

The Hadoop ecosystem contains a variety of great tools for working with your data.

However, it's rare for your data to start or end in Hadoop. It's much more common to have a

workflow that starts with data from external systems, such as logs from your web servers,

and ends with analytics hosted on a business intelligence (BI) system.

Data transfer tools help move data between those systems. More specifically, data transfer

tools provide three basic capabilities:

File transfer

Tools like Flume (described here ) and DistCp (described here ) help move files and flat

text, such as long entries, into your Hadoop cluster.

Database transfer

Tools like Sqoop (described next) provide a simple mechanism for moving data between

traditional relational databases, such as Oracle or SQL Server, and your Hadoop cluster.

Data triage

Tools like Storm (described here ) can be used to quickly evaluate and categorize new

data as it arrives onto your Hadoop system.

Search WWH ::

Custom Search

Home