Big Data Computing Applications - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

21.2.4 In-Memory Computing

The idea of running databases in memory was used by business intelligence

(BI) product company QlikView. In-memory allows the processing of mas-

sive quantities of data in main memory to provide immediate results from

analysis and transaction. The data to be processed is ideally real-time data or

as close to real time as is technically possible. Data in main memory (RAM)

can be accessed 100,000 times faster than data on a hard disk; this can dra-

matically decrease access time to retrieve data and make it available for the

purpose of reporting, analytics solutions, or other applications.

The medium used by a database to store data, that is, RAM, is divided

into pages. In-memory databases saves changed pages in savepoints,

which are asynchronously written to persistent storage in regular inter-

vals. Each committed transaction generates a log entry that is written to

nonvolatile storage—this log is written synchronously. In other words, a

transaction does not return before the corresponding log entry has been

written to persistent storage—in order to meet the durability require-

ment that was described earlier—thus ensuring that in-memory databases

meet (and pass) the ACID test (see Section 5.7, “Transaction Processing

Monitors” for a Note on ACID). After a power failure, the database pages

are restored from the savepoints; the database logs are applied to restore

the changes that were not captured in the savepoints. This ensures that the

database can be restored in memory to exactly the same state as before the

power failure.

21.2.5 Developing Big Data Applications

For most big data appliances, the ability to achieve scalability to accommo-

date growing data volumes is predicated on multiprocessing—distributing

the computation across the collection of computing nodes in ways that are

aligned with the distribution of data across the storage nodes. One of the key

objectives of using a multiprocessing node environment is to speed applica-

tion execution by breaking up large chunks of work into much smaller ones

that can be farmed out to a pool of available processing nodes. In the best

of all possible worlds, the data sets to be consumed and analyzed are also

distributed across a pool of storage nodes. As long as there are no depen-

dencies forcing any one specific task to wait to begin until another specific

one ends, these smaller tasks can be executed at the same time, that is, task

parallelism . More than just scalability, it is the concept of automated scalability

that has generated the present surge of interest in big data analytics (with

corresponding optimization of costs).

A good development framework will simplify the process of developing,

executing, testing, and debugging new application code, and this framework

should include

Search WWH ::

Custom Search

Home