Database Reference
In-Depth Information
Chapter 6
Moving Data
The tools and methods you use to move big data within the Hadoop sphere depend on the type of data to be
processed. This is a large category with many data sources, such as relational databases, log data, binary data, and
realtime data, among others. This chapter focuses on a few common data types and discusses some of the tools you
can use to process them. For instance, in this chapter you will learn to use Sqoop to process relational database data,
Flume to process log data, and Storm to process stream data.
You will also learn how this software can be sourced, installed, and used. Finally, I will show how a sample data
source can be processed and how all of these tools connect to Hadoop. But I begin with an explanation of the Hadoop
file system commands.
Moving File System Data
You can use Hadoop file system commands to move file-based data into and out of HDFS. In all the examples in this
book that employ Hadoop file system commands, I have used either a simple file name (myfile.txt) or a file name with
a path (/tmp/myfile.txt). However, a file may also be defined as a Uniform Resource Identifier (URI). The URI contains
the file's path, name, server, and a source identifier. For instance, for the two URIs that follow, the first file is on HDFS
while the second is on the file system. They also show that the files in question are located on the server hc1nn:
hdfs://hc1nn/user/hadoop/oozie_wf/fuel/pigwf/manufacturer.pig
file://hc1nn/tmp/manufacturer.pig
To indicate the data's source and destination for the move, each command accepts one or more URIs.
The Hadoop file system
cat
command (below) dumps the contents of the HDFS-based file manufacturer.pig to
STDOUT (the standard out stream on Linux). The URI is the text in the line that starts at the string “hdfs” and ends
with the file type (.pig):
hdfs dfs -cat hdfs://hc1nn/user/hadoop/oozie_wf/fuel/pigwf/manufacturer.pig
On the other hand, the
cat
command below dumps the Linux file system file flume_exec.sh to STDOUT
(the standard out stream):
hdfs dfs -cat file:///home/hadoop/flume/flume_exec.sh
Although file or hdfs and the server name can be specified in the URI, they are optional. In this chapter I use only
file names and paths.
Now, let's take a closer look at some of the most useful system commands.
Search WWH ::
Custom Search