Integration of Big Data and Data Warehousing - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

environment. There are several options to deploy the physical architecture with pros and cons for

each option.

The primary challenges that will confront the physical architecture of the next-generation data

warehouse platform include data loading, availability, data volume, storage performance, scalability,

diverse and changing query demands against the data, and operational costs of maintaining the envi-

ronment. The key challenges are outlined here and will be discussed with each architecture option.

Data loading

●

With no definitive format or metadata or schema, the loading process for Big Data is simply

acquiring the data and storing it as files. This task can be overwhelming when you want to process

real-time feeds into the system, while processing the data as large or microbatch windows of

processing. An appliance can be configured and tuned to address these rigors in the setup as

opposed to a pure-play implementation. The downside is a custom architecture configuration may

occur, but this can be managed.

●

Continuous processing of data in the platform can create contention for resources over a period of

time. This is especially true in the case of large documents or videos or images. If this requirement

is a key architecture driver, an appliance can be suitable for this specificity, as the guessing game

can be avoided in the configuration and setup process.

●

MapReduce configuration and optimization can be daunting in large environments and the

appliance architecture provides you reference architecture setups to avoid this pitfall.

Data availability

●

Data availability has been a challenge for any system that relates to processing and transforming

data for use by end users, and Big Data is no exception. The benefit of Hadoop or NoSQL is

to mitigate this risk and make data available for analysis immediately upon acquisition. The

challenge is to load the data quickly as there is no pretransformation required.

●

Data availability depends on the specificity of metadata to the SerDe or Avro layers. If data can be

adequately cataloged on acquisition, it can be available for analysis and discovery immediately.

●

Since there is no update of data in the Big Data layers, reprocessing new data containing updates

will create duplicate data and this needs to be handled to minimize the impact on availability.

Data volumes

●

Big Data volumes can easily get out of control due to the intrinsic nature of the data. Care and

attention needs to be paid to the growth of data upon each cycle of acquisition.

●

Retention requirements for the data can vary depending on the nature of the data and the recency

of the data and its relevance to the business:

●

Compliance requirements: SAFE Harbor, SOX, HIPAA, GLBA, and PCI regulations can

impact data security and storage. If you are planning to use these data types, plan accordingly.

●

Legal mandates: There are several transactional data sets that were not stored online and

were required by courts of law for discovery purposes in class-action lawsuits. The Big Data

infrastructure can be used as the storage engine for this data type, but the data mandates

certain compliance needs and additional security. This data volume can impact the overall

performance, and if such data sets are being processed on the Big Data platform, the appliance

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home