Storage Provisioning and Networking - Deploying and Managing a Cloud Infrastructure

Information Technology Reference

In-Depth Information

Snapshots Snapshots is a handy feature of HDFS that allows storing a copy of data at a

particular instance in time. Some common use cases are data backup, protection against

user errors, and disaster recovery.

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/

HdfsSnapshots.html

More technical details about the HDFS design and system can be found on the Apache

HDFS design web page:

http://hadoop.apache.org/docs/r0.18.3/hdfs_design.html

Loading and Processing Archived Data with HDFS and Hive

Sometimes companies archive some of their data to preserve space. Instead of discard-

ing old and rarely used data, companies generally prefer archiving in case they need the

data in the future. Data can be archived using compression tools such as GNU Zip (GZIP).

Loading compressed (deflated) or decompressed (inflated) data into the HDFS is trivial.

However, Hadoop cannot run multiple jobs for a GZIP file This is because a GZIP file

is considered to be one object. The GZIP replaces repeated text with tokens during the

deflation phase. Partitioning this data for processing could create invalid results. There-

fore, Hadoop instead prefers to treat it as one file Imagine the old file being in order of

hundreds of GBs. Hadoop or not, processing this file will take ages.

Hadoop recommends a two-step process: (1) loading the compressed data into HDFS,

and (2) reloading it from its HDFS source to a sequence file format. The recommended

practice is to insert data into a new Hive table, which is stored as a SequenceFile. Hive

( http://hive.apache.org/ ) is a distributed warehouse technology from Apache. It uses

HDFS as its base storage system and provides a SQL-like command-line interface (CLI)

known as Hive-QL.

A SequenceFile can be split by Hadoop and distributed across multiple nodes (for Map

tasks), whereas a GZIP file cannot be distributed. Let us walk through the steps required

to accomplish this:

1. First, create a Hadoop directory that our Hive table will use to store compressed file

This is done using the Hadoop command.

$ hadoop fs -mkdir /user/myuser/old_data/calllist

Search WWH ::

Custom Search

Home