Emerging Database Landscape - Big Data Imperatives

Databases Reference

In-Depth Information

Sharding: This is all about selectively organizing a particular set of data on different

nodes. Once you have data in your data store, different applications and data analysts access

different parts of the data set. In such situations, you can introduce horizontal scalability

by selectively putting different parts of the data set onto different servers. When the user

accesses specific data elements, their queries hit only the designated server. As a result,

they get rapid responses!

However, there is one drawback to this approach. If your query consists of data sets

distributed over several nodes, how do you aggregate these different data sets? This is a

design consideration you need to acknowledge while distributing data over several nodes.

You need to understand the query patterns first and then design the data distribution

in such a manner that, data that is commonly accessed together is kept on a single node.

This helps in improving query performance.

For example, if you know that most accesses of certain data sets are based on a physical

location, you can place that data close to the location where it's being accessed. Or if you

see most of the query patterns are around customer's surnames, then you might put all

customers with surnames starting from A to E on one node, F to J on another node, like so.

Sharding greatly improves the read and write performance; however, it does little

to improve resilience when used alone. Although the data is on different nodes, hence

a node failure makes that part of the data unavailable; thus only the users of the data on

that shard will have issues, and the rest of the users do not get impacted.

Combining Sharding with Replication: Replication and sharding are two orthogonal

techniques for data distribution, which means in your data design considerations; you can

use either approach or both the approaches. If you use both the approaches, essentially

you are taking the sharding approach but for each shard you are appointing a master node

(thus ensuring write consistency); the rest are all slaves with copies of the data items

(thus ensuring scalable read operations).

The Relational Database and the Non-Relational

Database

On a broad level, we can assume that there are two specific kinds of databases: the

relational database and the “non-relational” database. There are several definitions and

interpretations of what the characteristics of these two types of databases are.

Let's first define what structured data is and what unstructured data is. These definitions

heavily weigh into the characteristics of RDBMS and non-RDBMS systems.

Structured Data: Structured data contains an explicit structure of the data elements.

In other words, there exists metadata for every data element and how it will be stored

and accessed through SQL-based commands or other programming constructs are

clearly defined.

Unstructured Data: Unstructured data constitutes all other data that fall outside the

definition of structured data. Its structure is not explicitly declared in a schema. In some

cases, as with natural language, the structure may need to be discovered.

The Relational Database (RDBMS): A relational database stores data in tables and

pre-dominantly uses SQL-based commands to access the data. Mostly, the data structures

and resulting data models take the third-normal form (3NF) structure. In practice, the

data model is a set of tables and relationships between them, which are expressed in terms

Big Data Imperatives

Search WWH ::

Custom Search

Home