Big Data Processing Architectures - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

●

Databases can be abstracted from a physical layer for tuning the architecture.

●

Databases cannot handle processing of document or semi-structured types of data.

●

Procedural language or other programming language interfaces on the database add overhead in

processing and often end up processing data outside the database, requiring cycles of moving vast

amounts of data, and the problem will magnify with unstructured and other new data types.

To provide a robust processing approach for the additional data, the IT team recommended the

following infrastructure and processing recommendations.

Infrastructure

To process data other than structured and additional volumes to current data, a combination of hetero-

geneous technologies is recommended. The solution architecture will include the following type of

technologies:

●

Hadoop, NoSQL, or similar data processing platforms, driven on nonrelational and file system-

based architecture.

●

MapReduce programming model will be implemented for managing data processing and

transformation.

●

Data discovery and analysis will be implemented using Tableau or Datameer software that

abstracts the complexities of MapReduce and works directly on Hadoop for data integration and

management.

●

Analytics on Hadoop will be implemented using R, Predixion, and other competing technologies

capable to MapReduce integration and management.

●

In-memory data processing solutions like Qlikview need to be tested further for advanced

reporting requirements, depending on the success and adoption of the new stack of technologies.

●

Hardware infrastructure will be running on a commodity platform based on multicore processors

and up to 96 GB RAM.

●

Disk architecture for the new infrastructure will be not based on storage area network (SAN) but

on direct attached storage (DAS).

●

A redundant configuration will be set up for failover.

●

A landing zone will be available on the existing server with unlimited storage. The storage will be

designed for high capacity and not for high performance.

●

Security for the raw data will be implemented on current disk storage access policies.

●

Security rules for nonrelational data postprocessing will follow the existing rules in the LDAP

repository (integrated single sign on security process) for EDW data.

Data processing

●

Processing of different types of data will be assigned to different clusters of systems.

●

Documents and text data will be processed using discovery rules. The result set will be a

structured output of tags and keywords, occurrences, counts, and processing dates.

●

Audit, balance, and control will be implemented for tracing data processing across layers.

●

Business rules will be programmatically implemented with MapReduce and other programming

languages that can scale and perform like Java or Ruby.

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home