Emerging Database Landscape - Big Data Imperatives

Databases Reference

In-Depth Information

■ This advantage is not necessarily all one way. Each column you need to retrieve

needs to be accessed separately, whereas you can retrieve an entire row in a single read. so the

greater the amount of the information that you need from a row the less performance advantage

that a column-based approach offers. To take a simplistic example, if you want to read a single

row then that is one read. if that row has 15 columns then that is, in theory, 15 reads, so there

is a trade-off between the number of rows you want to read versus the number of columns,

together with the overhead of finding the rows/columns you need to read in the first place.

Note

A further consideration is that there is a class of query that can be answered directly

from an index. These are known as “count queries.” Let's take, for example, the question

posed previously: Count the married, employed customers who own a house. If you have

a row-based database, and you have appropriate indexes defined, then you can resolve

these queries without having to read the data at all. Of course, in the case of a column-based

database the data is the index (or vice versa) so you should always be able to answer

count queries in this way.

■

Note

in a big data environment, count types of queries are common.

Time-based Queries: The issue here is not so much of performance but more of whether

relevant queries are possible at all. This is because you not only need the extended SQL in

order to handle time-lapse queries but also the ability to store time-stamped transactions.

Neither of these is typically the case with traditional RDBMS data stores. Conversely, there are

a number of column-based data stores that provide exactly such an approach.

Note that there are a number of use cases that require such capabilities that go

beyond conventional databases. For example, in telecommunications it is mandated that

companies must retain call detail records, against which relevant queries can be run,

often on a time-lapsed basis. Similarly, you will want to be able to run time-based queries

against log information (from databases, system logs, web logs and so forth) as well as

e-mails and other corporate data that you may need for evidentiary reasons.

Requirements for the Next Generation

Data Warehouses

In order to provide the best possible performance to the largest number of users, data

warehouses are significantly pre-designed. While logically this may be a reflection of

the data model that underpins the data warehouse, in physical terms this means the

pre-building indexes, careful partitioning of data, parallel disk striping, developing of

pre-aggregated tables, etc.

However, from our discussions so far, we also understood that, the big data scale and

type of workloads play a significant role in database design considerations. On the basis

Search WWH ::

Custom Search

Home