Database Reference
In-Depth Information
Chapter 6. Data Transfer
Data transfer deals with three important questions:
▪ How do you get data into a Hadoop cluster?
▪ How do you get data out of a Hadoop cluster?
▪ How do you move data from one Hadoop cluster to another Hadoop cluster?
In general, Hadoop is not a transactional engine, where data is loaded in small, discrete, re-
lated bits of information like it would be in an airline reservation system. Instead, data is
bulk loaded from external sources such a flat files for sensors, bulk loads from sources like
http://www.data.gov for U.S. federal government data or log files, or transfers from relation-
al systems.
The Hadoop ecosystem contains a variety of great tools for working with your data.
However, it's rare for your data to start or end in Hadoop. It's much more common to have a
workflow that starts with data from external systems, such as logs from your web servers,
and ends with analytics hosted on a business intelligence (BI) system.
Data transfer tools help move data between those systems. More specifically, data transfer
tools provide three basic capabilities:
File transfer
Tools like Flume (described here ) and DistCp (described here ) help move files and flat
text, such as long entries, into your Hadoop cluster.
Database transfer
Tools like Sqoop (described next) provide a simple mechanism for moving data between
traditional relational databases, such as Oracle or SQL Server, and your Hadoop cluster.
Data triage
Tools like Storm (described here ) can be used to quickly evaluate and categorize new
data as it arrives onto your Hadoop system.
Search WWH ::




Custom Search