Extracting Value From Big Data: In-Memory Solutions, Real Time Analytics, And Recommendation Systems - Big Data Imperatives

Databases Reference

In-Depth Information

In the earlier chapters we discussed how newer infrastructures and technologies

like Hadoop, NoSQL, and parallel processing platforms are solving the challenges

of processing massive amounts of data in a shorter time and with lower cost. Now,

Hadoop has almost become the de facto standard for many of the batch processing

analytics applications. While Hadoop and map-reduce in general do a pretty good job

in processing massive amounts of data through parallel batch processing, they weren't

designed to serve the real-time part of the business.

Before we deep dive into architectural constructs and discuss solutions, let's

understand few key concepts.

In-Memory Database Grids. In-memory data grids were originally designed to

complement traditional databases by allowing critical pieces of fast-changing data and

application logic to operate at the memory layer with much higher throughput and

lower latency. An in-memory database grid stores data as objects in memory, avoiding

expensive disk round trips. The data model is usually object-oriented (serialized) and

non-relational, organized as collections of logically related objects that can be rapidly

created, updated, read, and removed. A common implementation scenario of in-memory

data grid is as a “distributed cache” for one or more databases. The in-memory data grids

are built on Java, allowing the grid to run embedded inside the application server cluster

eliminating the traffic to the database servers.

It is not a new attempt to use main memory as a storage area instead of a disk. There

are numerous examples of effectively using main memory databases, as they perform

much faster than disk-based databases. When you SMS or call someone, most mobile

service providers use main memory database to get the information about your contact

as soon as possible. The software on your cell phone also uses main memory database

effectively to show the caller details including the picture.

There are many in-memory data grid products, both commercial and open source.

Some of the most commonly used products are Oracle Coherence, IBM Websphere

eXtreme Scale, Hazelcast, JBoss Infinispan, GridGain, DataGrid, VMware Gemfire, Oracle

Coherence, Gigaspaces XAP, Terracotta Ehcache and BigMemory.

Distributed caching products like Memcached provide a simple, high performance,

in-memory key-value store. Its “scalability” is addressed through making servers completely

independent of each other. The client (configured with a list of all servers) ties all the data in

the servers together. A hash function maps the keys to servers on each client, thus ensuring

consistency of data even in the case where all clients to have identical server lists. Data

consistency becomes a concern when different clients have different server lists or different

hash functions. Distributed caching do not have built-in support for replication and no

native support for high availability, so any network partition or server crash leads to a loss of

availability.

In contrast, in-memory data grids are fully clustered and are always aware of each

other. They use a variety of algorithms to establish distributed consensus and ensure

higher levels of consistency guarantees. In addition in-memory data grids provide

support for distributed transactions, scatter-gather parallel query processing, tiered

caching, publish-subscribe event processing, a framework to integrate data with existing

databases, replication over wide area networks, etc.

In-memory data grids also enable new computing paradigms for cloud, complex

event processing and data analysis. Cloud deployments promise dynamic scalability

irrespective of the spikes in capacity. When spikes occur, the automatic detection and

Big Data Imperatives

Search WWH ::

Custom Search

Home