Databases Reference
In-Depth Information
limitations that exist in traditional RDBMS platforms. In the new architecture the RDBMS,
DBMS, and NoSQL technologies have a role, and can be deployed as needed to solve that
requirement.
Distributed file-based storage —data is stored in files, which is cheaper compared to storing on a
database. Additionally, data is distributed across systems, providing built-in redundancy.
Linearly scalable infrastructure —every piece of infrastructure added will create 100% scalability
from the CPU to storage and memory.
Programmable APIs —all modules of data processing will be driven by procedural programming
application programming interfaces (APIs), which allow for parallel processing without the
limitations imposed by concurrency. The same data can be processed across systems for different
purposes or the same logic can be processed across different systems. There are different case
studies on these techniques.
High-speed replication —data is able to replicate at high speeds across the network.
High availability —data and the infrastructure are always available and accessible by the users.
Localized processing of data and storage of results —the ability to process and store results
locally, meaning compute and store occur in the same disk within the storage architecture.
This means one needs to store replicated copies of data across disks to accomplish localized
processing.
Fault tolerance —with extreme replication and distributed processing, system failures could be
rebalanced with relative ease, as mandated by web users and applications.
With the features and capabilities discussed here, the limitations of distributed data processing with
relational databases are not a real barrier anymore. The new-generation architecture has created a scal-
able and extensible data processing environment for web applications and has been adopted widely by
companies that use web platforms. Over the last decade many of these technologies have been commit-
ted back to the open-source community for further development by innovators across the world (see the
Apache Foundation website at for committers across projects). The new-generation data processing plat-
forms, including Hadoop, Hive, HBase, Cassandra, MongoDB, CouchDB, REDIS, Neo4J, DynamoDB,
and more, are all products of these architectural pursuits, and are discussed in this chapter.
There is a continuum of technology development in this direction (by the time we are finished with
this topic, there will be newer developments that can be found on the companion website for this topic
( http://booksite.elsevier.com/9780124058910 )).
Big Data processing requirements
What is unique about Big Data processing? What makes it different or mandates new thinking? To
understand this better let us look at the underlying requirements. We can classify Big Data require-
ments based on its five main characteristics:
Volume:
Size of data to be processed is large—it needs to be broken into manageable chunks.
Data needs to be processed in parallel across multiple systems.
Data needs to be processed across several program modules simultaneously.
 
Search WWH ::




Custom Search