Building a NoSQL-Based Web App to Collect Crowd-Sourced Data - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

relational database . The relational database concept owes everything to the work of

Edgar F. Codd, a former Royal Air Force pilot and World War II veteran.

Codd has a unique place in computer science history. As digital computing was

being developed in the 1960s, early databases were structured in a hierarchical manner,

meaning that data was structured as a collection of parent-child relationships. This

type of database was sometimes convenient for the application storing the data, espe-

cially when the data itself was inherently hierarchical (for example, a classification of

plant species). However, when the underlying data expressed complex relationships or

had no inherent hierarchy, modeling data using a hierarchical database model was very

clumsy. The biggest problem was the lack of a feature that we take for granted today:

free-form search capabilities. In order to traverse the data stored in this manner, a user

had to have a knowledge of the hierarchical structure. It became clear that a more

f lexible and generalizable model was necessary to make sense of ever-growing datasets.

Codd's relational model is currently so ubiquitous that the concept is well known

even to the casual database user, but let's revisit the basic characteristics. Codd pro-

posed that each record of data be described using tuples , which are discrete sets of

values that can be individually referenced by a unique identifier. In many applications,

tuples are simply ordered lists of values, and each value can be retrieved by referencing

a position in the list. In most programming languages, tuples are zero-based , mean-

ing that the first element is referenced by “0,” the second by “1,” and so on. In Codd's

relational database model, each element in the database record is accessed not by

number, but by a name known as an attribute . For example, if I were to store a data

record of Edgar Codd's name, I could define an attribute as first_name to reference

“Edgar” and another attribute as last_name to store “Codd.” Tuples of the same

type can be organized into tables, which can then be cross-referenced to each other

based on an existing relationship.

A key component to the success of the relational database model is the idea of

normalization : In Codd's view, each unit of data should exist only once in a single

table. This cuts down on redundancy and storage costs. More importantly, normaliza-

tion makes it possible to keep data consistent by having to change values only in a sin-

gle location. In Codd's system, a column in each table can be designated as a primary

key , which is an attribute value that is used to retrieve a single record unambiguously.

The primary key could be used to connect these relationships using some type of syn-

tactical query. The Structured Query Language (SQL) was later created by other

IBM researchers to express relational queries. Codd's concept provides the ability to

ask a variety of questions about the data in various tables so long as the data can be

related in some way.

Let's take a look at the simple example of using a relational table in Listing 3.1.

Our example database holds two types of values: identities of computer scientists and

information about countries. Unless some very sweeping changes happen at the United

Nations, we can assume that each record in the “countries” table is unique, so we

can treat the country name as a primary key. On the other hand, it's possible for two

humans to have exactly the same name. Therefore, in our “people” table, we need to

Data Just Right: Introduction to Large-Scale Data and Analytics

Search WWH ::

Custom Search

Home