The Cloud Makes It Rain Money: The Business in Cloud Computing - Deploying and Managing a Cloud Infrastructure

Information Technology Reference

In-Depth Information

The Importance of Cloud Backup and Task Documentation

In a development team, it can often be the case that a common user account is shared

in the cloud between multiple developers who work on the same project. Apart from the

obvious advantages of being able to execute and monitor jobs in the cloud, there could

be some consequences, despite the presence of expert users and caution.

Deleting important processed data accidentally is one such consequence of a shared

account. However, it is not always possible to copy processed data for backup some-

where else, given the sheer size of some data as well as the network time required to

complete the process. It's possible to face such a situation when transforming/process-

ing data and loading it into a Hive table. Hive is a distributed data warehouse technology

from Apache. It is based on the MapReduce framework and uses Hadoop Distributed File

System (HDFS) as its underlying file system.

The decision that generally has to be made is whether to keep the unprocessed/raw data

or to discard it. Not all data can be stored for too long. Therefore, the decision is gov-

erned by three factors: the future relevance/importance of the raw data, the size of the

raw data, and the time required to archive it. We usually go for compression and archival,

even if there is a semblance of doubt that the data will be required in future.

Suppose that during an ongoing task you had to create a few intermediate external Hive

tables that would then be used for joins (per criteria) to create another external Hive

table loaded with processed data. An external table is one that stores just the schema

and location of the table. If the table is destroyed or dropped, the data stays intact and

is not removed. However, some other team member used the same account to acciden-

tally delete the entire directory path underneath the HDFS that those tables were read-

ing as input source. The team member had no knowledge of the Hive tables and thus

even cleared the HDFS trash directory, while mistakenly assuming that the data can be

discarded.

Luckily, the raw data was archived and stored in another HDFS directory path. In addition,

the documentation was already in place when the tedious task (which required mathe-

matical operations and time zone conversions in long complex queries) was performed.

What followed was just the exercise of automating the task of raw data reprocessing and

putting it back into the missing HDFS directory path.

Search WWH ::

Custom Search

Home