Big Data, Data Warehouses, and Business Intelligence Systems - Database Processing: Fundamentals, Design, and Implementation

Database Reference

In-Depth Information

An SQL join statement can be written to create a view

showing products that have appeared together in a transac-

tion. That view can then be processed to compute support,

and the support view can then be processed to compute

confidence and lift.

A distributed database is a database that is stored and

processed on more than one computer. A replicated data-

base is one in which multiple copies of some or all of the

database are stored on different computers. A partitioned

database is one in which different pieces of the database are

stored on different computers. A distributed database can be

replicated and distributed.

Distributed databases pose processing challenges. If a

database is updated on a single computer, then the challenge

is simply to ensure that the copies of the database are logically

consistent when they are distributed. However, if updates

are to be made on more than one computer, the challenges

become significant. If the database is partitioned and not

replicated, then challenges occur if transactions span data on

more than one computer. If the database is replicated and if

updates occur to the replicated portions, then a special lock-

ing algorithm called distributed two-phase locking is required.

Implementing this algorithm can be difficult and expensive.

Objects consist of methods and properties or data values.

All objects of a given class have the same methods, but they

have different property values. Object persistence is the pro-

cess of storing object property values. Relational databases

are difficult to use for object persistence. Some specialized

products called object-oriented DBMSs were developed in

the 1990s but never received commercial acceptance. Oracle

and others have extended the capabilities of their relational

DBMS products to provide support for object persistence.

Such databases are referred to as object-relational databases.

The NoSQL movement (now often read as “not only

SQL”) is built upon the need to meet the Big Data stor-

age needs of companies such as Amazon.com, Google, and

Facebook. The tools used to do this are nonrelational DBMSs

known as structured storage. Early examples were Dynamo

and Bigtable; a more recent popular example is Cassandra.

These products use a non-normalized table structure built

on columns, super columns, and column families tied

together by rowkey values from a keyspace. Data process-

ing of the very large data sets found in Big Data is done by

the MapReduce process, which breaks a data processing

task into many parallel tasks done by many computers in

the cluster and then combines these results to produce

a final result. An emerging product that is supported by

Microsoft and Oracle Corporation is the Hadoop Distributed

File System (HDFS), with its spinoffs HBase, a nonrelational

storage component, and Pig, a query language.

Key Terms

Amazon Web Services (AWS)

Big Data

Bigtable

business intelligence (BI) system

Cassandra

click-stream data

cloud computing

conformed dimension

curse of dimensionality

data mart

data mining application

data warehouse

data warehouse metadata database

date dimension

dimension table

dimensional database

distributed database

distributed two-phase locking

dirty data

drill down

Dynamo

DynamoDB database sevice

EC2 service

enterprise data warehouse (EDW)

architecture

Extract, Transform, and Load (ETL) System

F score

Hadoop Distributed File System (HDFS)

HBase

Host machine

hypervisor

fact table

M score

measure

method

MapReduce

nonintegrated data

NoSQL

Not only SQL

object

object-oriented DBMS (OODBMS)

object-oriented programming (OOP)

Database Processing: Fundamentals, Design, and Implementation

Search WWH ::

Custom Search

Home