Databases Reference
In-Depth Information
environment. There are several options to deploy the physical architecture with pros and cons for
each option.
The primary challenges that will confront the physical architecture of the next-generation data
warehouse platform include data loading, availability, data volume, storage performance, scalability,
diverse and changing query demands against the data, and operational costs of maintaining the envi-
ronment. The key challenges are outlined here and will be discussed with each architecture option.
Data loading
With no definitive format or metadata or schema, the loading process for Big Data is simply
acquiring the data and storing it as files. This task can be overwhelming when you want to process
real-time feeds into the system, while processing the data as large or microbatch windows of
processing. An appliance can be configured and tuned to address these rigors in the setup as
opposed to a pure-play implementation. The downside is a custom architecture configuration may
occur, but this can be managed.
Continuous processing of data in the platform can create contention for resources over a period of
time. This is especially true in the case of large documents or videos or images. If this requirement
is a key architecture driver, an appliance can be suitable for this specificity, as the guessing game
can be avoided in the configuration and setup process.
MapReduce configuration and optimization can be daunting in large environments and the
appliance architecture provides you reference architecture setups to avoid this pitfall.
Data availability
Data availability has been a challenge for any system that relates to processing and transforming
data for use by end users, and Big Data is no exception. The benefit of Hadoop or NoSQL is
to mitigate this risk and make data available for analysis immediately upon acquisition. The
challenge is to load the data quickly as there is no pretransformation required.
Data availability depends on the specificity of metadata to the SerDe or Avro layers. If data can be
adequately cataloged on acquisition, it can be available for analysis and discovery immediately.
Since there is no update of data in the Big Data layers, reprocessing new data containing updates
will create duplicate data and this needs to be handled to minimize the impact on availability.
Data volumes
Big Data volumes can easily get out of control due to the intrinsic nature of the data. Care and
attention needs to be paid to the growth of data upon each cycle of acquisition.
Retention requirements for the data can vary depending on the nature of the data and the recency
of the data and its relevance to the business:
Compliance requirements: SAFE Harbor, SOX, HIPAA, GLBA, and PCI regulations can
impact data security and storage. If you are planning to use these data types, plan accordingly.
Legal mandates: There are several transactional data sets that were not stored online and
were required by courts of law for discovery purposes in class-action lawsuits. The Big Data
infrastructure can be used as the storage engine for this data type, but the data mandates
certain compliance needs and additional security. This data volume can impact the overall
performance, and if such data sets are being processed on the Big Data platform, the appliance
Search WWH ::




Custom Search