Databases Reference
In-Depth Information
In the earlier chapters we discussed how newer infrastructures and technologies
like Hadoop, NoSQL, and parallel processing platforms are solving the challenges
of processing massive amounts of data in a shorter time and with lower cost. Now,
Hadoop has almost become the de facto standard for many of the batch processing
analytics applications. While Hadoop and map-reduce in general do a pretty good job
in processing massive amounts of data through parallel batch processing, they weren't
designed to serve the real-time part of the business.
Before we deep dive into architectural constructs and discuss solutions, let's
understand few key concepts.
In-Memory Database Grids. In-memory data grids were originally designed to
complement traditional databases by allowing critical pieces of fast-changing data and
application logic to operate at the memory layer with much higher throughput and
lower latency. An in-memory database grid stores data as objects in memory, avoiding
expensive disk round trips. The data model is usually object-oriented (serialized) and
non-relational, organized as collections of logically related objects that can be rapidly
created, updated, read, and removed. A common implementation scenario of in-memory
data grid is as a “distributed cache” for one or more databases. The in-memory data grids
are built on Java, allowing the grid to run embedded inside the application server cluster
eliminating the traffic to the database servers.
It is not a new attempt to use main memory as a storage area instead of a disk. There
are numerous examples of effectively using main memory databases, as they perform
much faster than disk-based databases. When you SMS or call someone, most mobile
service providers use main memory database to get the information about your contact
as soon as possible. The software on your cell phone also uses main memory database
effectively to show the caller details including the picture.
There are many in-memory data grid products, both commercial and open source.
Some of the most commonly used products are Oracle Coherence, IBM Websphere
eXtreme Scale, Hazelcast, JBoss Infinispan, GridGain, DataGrid, VMware Gemfire, Oracle
Coherence, Gigaspaces XAP, Terracotta Ehcache and BigMemory.
Distributed caching products like Memcached provide a simple, high performance,
in-memory key-value store. Its “scalability” is addressed through making servers completely
independent of each other. The client (configured with a list of all servers) ties all the data in
the servers together. A hash function maps the keys to servers on each client, thus ensuring
consistency of data even in the case where all clients to have identical server lists. Data
consistency becomes a concern when different clients have different server lists or different
hash functions. Distributed caching do not have built-in support for replication and no
native support for high availability, so any network partition or server crash leads to a loss of
availability.
In contrast, in-memory data grids are fully clustered and are always aware of each
other. They use a variety of algorithms to establish distributed consensus and ensure
higher levels of consistency guarantees. In addition in-memory data grids provide
support for distributed transactions, scatter-gather parallel query processing, tiered
caching, publish-subscribe event processing, a framework to integrate data with existing
databases, replication over wide area networks, etc.
In-memory data grids also enable new computing paradigms for cloud, complex
event processing and data analysis. Cloud deployments promise dynamic scalability
irrespective of the spikes in capacity. When spikes occur, the automatic detection and
 
Search WWH ::




Custom Search