Information Technology Reference
In-Depth Information
Snapshots Snapshots is a handy feature of HDFS that allows storing a copy of data at a
particular instance in time. Some common use cases are data backup, protection against
user errors, and disaster recovery.
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/
HdfsSnapshots.html
More technical details about the HDFS design and system can be found on the Apache
HDFS design web page:
http://hadoop.apache.org/docs/r0.18.3/hdfs_design.html
Loading and Processing Archived Data with HDFS and Hive
Sometimes companies archive some of their data to preserve space. Instead of discard-
ing old and rarely used data, companies generally prefer archiving in case they need the
data in the future. Data can be archived using compression tools such as GNU Zip (GZIP).
Loading compressed (deflated) or decompressed (inflated) data into the HDFS is trivial.
However, Hadoop cannot run multiple jobs for a GZIP file This is because a GZIP file
is considered to be one object. The GZIP replaces repeated text with tokens during the
deflation phase. Partitioning this data for processing could create invalid results. There-
fore, Hadoop instead prefers to treat it as one file Imagine the old file being in order of
hundreds of GBs. Hadoop or not, processing this file will take ages.
Hadoop recommends a two-step process: (1) loading the compressed data into HDFS,
and (2) reloading it from its HDFS source to a sequence file format. The recommended
practice is to insert data into a new Hive table, which is stored as a SequenceFile. Hive
( http://hive.apache.org/ ) is a distributed warehouse technology from Apache. It uses
HDFS as its base storage system and provides a SQL-like command-line interface (CLI)
known as Hive-QL.
A SequenceFile can be split by Hadoop and distributed across multiple nodes (for Map
tasks), whereas a GZIP file cannot be distributed. Let us walk through the steps required
to accomplish this:
1. First, create a Hadoop directory that our Hive table will use to store compressed file
This is done using the Hadoop command.
$ hadoop fs -mkdir /user/myuser/old_data/calllist
Search WWH ::




Custom Search