Moving Data - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Chapter 6

Moving Data

The tools and methods you use to move big data within the Hadoop sphere depend on the type of data to be

processed. This is a large category with many data sources, such as relational databases, log data, binary data, and

realtime data, among others. This chapter focuses on a few common data types and discusses some of the tools you

can use to process them. For instance, in this chapter you will learn to use Sqoop to process relational database data,

Flume to process log data, and Storm to process stream data.

You will also learn how this software can be sourced, installed, and used. Finally, I will show how a sample data

source can be processed and how all of these tools connect to Hadoop. But I begin with an explanation of the Hadoop

file system commands.

Moving File System Data

You can use Hadoop file system commands to move file-based data into and out of HDFS. In all the examples in this

book that employ Hadoop file system commands, I have used either a simple file name (myfile.txt) or a file name with

a path (/tmp/myfile.txt). However, a file may also be defined as a Uniform Resource Identifier (URI). The URI contains

the file's path, name, server, and a source identifier. For instance, for the two URIs that follow, the first file is on HDFS

while the second is on the file system. They also show that the files in question are located on the server hc1nn:

hdfs://hc1nn/user/hadoop/oozie_wf/fuel/pigwf/manufacturer.pig

file://hc1nn/tmp/manufacturer.pig

To indicate the data's source and destination for the move, each command accepts one or more URIs.

The Hadoop file system cat command (below) dumps the contents of the HDFS-based file manufacturer.pig to

STDOUT (the standard out stream on Linux). The URI is the text in the line that starts at the string “hdfs” and ends

with the file type (.pig):

hdfs dfs -cat hdfs://hc1nn/user/hadoop/oozie_wf/fuel/pigwf/manufacturer.pig

On the other hand, the cat command below dumps the Linux file system file flume_exec.sh to STDOUT

(the standard out stream):

hdfs dfs -cat file:///home/hadoop/flume/flume_exec.sh

Although file or hdfs and the server name can be specified in the URI, they are optional. In this chapter I use only

file names and paths.

Now, let's take a closer look at some of the most useful system commands.

Search WWH ::

Custom Search

Home