Databases Reference
In-Depth Information
Sharding: This is all about selectively organizing a particular set of data on different
nodes. Once you have data in your data store, different applications and data analysts access
different parts of the data set. In such situations, you can introduce horizontal scalability
by selectively putting different parts of the data set onto different servers. When the user
accesses specific data elements, their queries hit only the designated server. As a result,
they get rapid responses!
However, there is one drawback to this approach. If your query consists of data sets
distributed over several nodes, how do you aggregate these different data sets? This is a
design consideration you need to acknowledge while distributing data over several nodes.
You need to understand the query patterns first and then design the data distribution
in such a manner that, data that is commonly accessed together is kept on a single node.
This helps in improving query performance.
For example, if you know that most accesses of certain data sets are based on a physical
location, you can place that data close to the location where it's being accessed. Or if you
see most of the query patterns are around customer's surnames, then you might put all
customers with surnames starting from A to E on one node, F to J on another node, like so.
Sharding greatly improves the read and write performance; however, it does little
to improve resilience when used alone. Although the data is on different nodes, hence
a node failure makes that part of the data unavailable; thus only the users of the data on
that shard will have issues, and the rest of the users do not get impacted.
Combining Sharding with Replication: Replication and sharding are two orthogonal
techniques for data distribution, which means in your data design considerations; you can
use either approach or both the approaches. If you use both the approaches, essentially
you are taking the sharding approach but for each shard you are appointing a master node
(thus ensuring write consistency); the rest are all slaves with copies of the data items
(thus ensuring scalable read operations).
The Relational Database and the Non-Relational
Database
On a broad level, we can assume that there are two specific kinds of databases: the
relational database and the “non-relational” database. There are several definitions and
interpretations of what the characteristics of these two types of databases are.
Let's first define what structured data is and what unstructured data is. These definitions
heavily weigh into the characteristics of RDBMS and non-RDBMS systems.
Structured Data: Structured data contains an explicit structure of the data elements.
In other words, there exists metadata for every data element and how it will be stored
and accessed through SQL-based commands or other programming constructs are
clearly defined.
Unstructured Data: Unstructured data constitutes all other data that fall outside the
definition of structured data. Its structure is not explicitly declared in a schema. In some
cases, as with natural language, the structure may need to be discovered.
The Relational Database (RDBMS): A relational database stores data in tables and
pre-dominantly uses SQL-based commands to access the data. Mostly, the data structures
and resulting data models take the third-normal form (3NF) structure. In practice, the
data model is a set of tables and relationships between them, which are expressed in terms
 
Search WWH ::




Custom Search