Databases Reference
In-Depth Information
Big Data acquisition and processing has been discussed in prior chapters in this topic. While the
technology is inexpensive to acquire the data, there are several complexities with managing and pro-
cessing the data within the new technology environments and further integrating it with the RDBMS
or DBMS. A governance process and methodology is required to manage all of these processes.
Data in the Big Data world is largely created and stored as files. All the processes associated with
managing data and processing data are file-based, whether in Hadoop or NoSQL environments. While
the new technologies are special file systems, simply deleting files will not be the right solution, and
we need to create a robust and well-defined data retention and archiving strategy.
Processing data in the Big Data world uses enterprise metadata and master data. With the volume
and data type to be processed, we need to implement policies to manage this environment extremely
closely to prevent any unwanted alterations.
Example: information life-cycle management and social media data
Social media data is one of the most popular data assets that every organization likes to tap into for
getting a clearer view of their customers and the likes and dislikes that are being expressed by their
customers about their products, services, competition, and the impact of the sharing of these sen-
timents. The data for this exercise is extracted from social media channels and websites, Internet
forums and user communities, and consumer websites and reviews.
There is a lot of hidden value within the data from social media and there are several insights that
provide critical clues to the success or failure of a particular brand related to a campaign, product,
service, and more. The bigger questions are: What is the value of the data once the initial discoveries
have been completed? Is there any requirement to keep the data to actually monitor the trend? Or is a
statistical summary enough to accomplish the same result?
In the case of social media data, the lifetime value of the data is very short—from the time of acquisi-
tion to insights the entire process may take hours to a week probably. The information life-cycle manage-
ment policy for this data will be storage for two to four weeks for the raw data, six to eight weeks for
processed data sets after the processing of the raw data, and then one to three years for summary data
aggregated from processed data sets to be used in analytics. The reason for this type of a policy suggestion
is the raw data does not have value once it has been processed with the different types of business rules and
the processed data sets do not carry value beyond the additional eight weeks. Data is typically discarded
from the storage layer and is not required to be stored offline or offsite for any further reuse or reloading.
Without creating a governance policy on information life-cycle management for data, we will end
up collecting a lot of data that is having no business impact or value, and end up wasting processing
and computing cycles. A strong information life-cycle management policy is necessary for ensuring
the success of Big Data.
From this example we can see that managing the information life cycle for Big Data is similar to
any other data, but there are some areas that need special attention and can impact your Big Data pro-
gram negatively if not implemented. Let us look at each aspect in the following section.
Governance
From a Big Data perspective you need both data and program governance to be implemented.
Data governance
Data retention:
What data should be retained?
Search WWH ::




Custom Search