Information Management and Life Cycle for Big Data - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

Big Data acquisition and processing has been discussed in prior chapters in this topic. While the

technology is inexpensive to acquire the data, there are several complexities with managing and pro-

cessing the data within the new technology environments and further integrating it with the RDBMS

or DBMS. A governance process and methodology is required to manage all of these processes.

Data in the Big Data world is largely created and stored as files. All the processes associated with

managing data and processing data are file-based, whether in Hadoop or NoSQL environments. While

the new technologies are special file systems, simply deleting files will not be the right solution, and

we need to create a robust and well-defined data retention and archiving strategy.

Processing data in the Big Data world uses enterprise metadata and master data. With the volume

and data type to be processed, we need to implement policies to manage this environment extremely

closely to prevent any unwanted alterations.

Example: information life-cycle management and social media data

Social media data is one of the most popular data assets that every organization likes to tap into for

getting a clearer view of their customers and the likes and dislikes that are being expressed by their

customers about their products, services, competition, and the impact of the sharing of these sen-

timents. The data for this exercise is extracted from social media channels and websites, Internet

forums and user communities, and consumer websites and reviews.

There is a lot of hidden value within the data from social media and there are several insights that

provide critical clues to the success or failure of a particular brand related to a campaign, product,

service, and more. The bigger questions are: What is the value of the data once the initial discoveries

have been completed? Is there any requirement to keep the data to actually monitor the trend? Or is a

statistical summary enough to accomplish the same result?

In the case of social media data, the lifetime value of the data is very short—from the time of acquisi-

tion to insights the entire process may take hours to a week probably. The information life-cycle manage-

ment policy for this data will be storage for two to four weeks for the raw data, six to eight weeks for

processed data sets after the processing of the raw data, and then one to three years for summary data

aggregated from processed data sets to be used in analytics. The reason for this type of a policy suggestion

is the raw data does not have value once it has been processed with the different types of business rules and

the processed data sets do not carry value beyond the additional eight weeks. Data is typically discarded

from the storage layer and is not required to be stored offline or offsite for any further reuse or reloading.

Without creating a governance policy on information life-cycle management for data, we will end

up collecting a lot of data that is having no business impact or value, and end up wasting processing

and computing cycles. A strong information life-cycle management policy is necessary for ensuring

the success of Big Data.

From this example we can see that managing the information life cycle for Big Data is similar to

any other data, but there are some areas that need special attention and can impact your Big Data pro-

gram negatively if not implemented. Let us look at each aspect in the following section.

Governance

From a Big Data perspective you need both data and program governance to be implemented.

Data governance

●

Data retention:

●

What data should be retained?

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home