Information Technology Reference
In-Depth Information
The Importance of Cloud Backup and Task Documentation
In a development team, it can often be the case that a common user account is shared
in the cloud between multiple developers who work on the same project. Apart from the
obvious advantages of being able to execute and monitor jobs in the cloud, there could
be some consequences, despite the presence of expert users and caution.
Deleting important processed data accidentally is one such consequence of a shared
account. However, it is not always possible to copy processed data for backup some-
where else, given the sheer size of some data as well as the network time required to
complete the process. It's possible to face such a situation when transforming/process-
ing data and loading it into a Hive table. Hive is a distributed data warehouse technology
from Apache. It is based on the MapReduce framework and uses Hadoop Distributed File
System (HDFS) as its underlying file system.
The decision that generally has to be made is whether to keep the unprocessed/raw data
or to discard it. Not all data can be stored for too long. Therefore, the decision is gov-
erned by three factors: the future relevance/importance of the raw data, the size of the
raw data, and the time required to archive it. We usually go for compression and archival,
even if there is a semblance of doubt that the data will be required in future.
Suppose that during an ongoing task you had to create a few intermediate external Hive
tables that would then be used for joins (per criteria) to create another external Hive
table loaded with processed data. An external table is one that stores just the schema
and location of the table. If the table is destroyed or dropped, the data stays intact and
is not removed. However, some other team member used the same account to acciden-
tally delete the entire directory path underneath the HDFS that those tables were read-
ing as input source. The team member had no knowledge of the Hive tables and thus
even cleared the HDFS trash directory, while mistakenly assuming that the data can be
discarded.
Luckily, the raw data was archived and stored in another HDFS directory path. In addition,
the documentation was already in place when the tedious task (which required mathe-
matical operations and time zone conversions in long complex queries) was performed.
What followed was just the exercise of automating the task of raw data reprocessing and
putting it back into the missing HDFS directory path.
Search WWH ::




Custom Search