Emerging Database Landscape - Big Data Imperatives

Databases Reference

In-Depth Information

Returning to the example discussed above, we see that a columnar database would

not only eliminate 43 days of data, it would also eliminate 28 columns of data. Returning

only the columns for toasters and units sold, the columnar database would fetch only

14 million data elements, or 93 percent less data. By returning so much less data, columnar

databases are much faster than row-based databases when analyzing large data sets.

Combination/Workload Challenges

The issue here is combining high performance for individual queries with similarly high

performance across multiple queries and query types, some of which may be very short

running queries and others of which may be long running, or anything in between.

Big data scale throws open the environment to address such a combination of workloads.

From our discussions around columnar databases, we can see there is a clear architectural

benefit to be gained if we use a column-based approach. This is because you don't have

to worry about the performance of individual queries and can focus your design efforts

to ensure high performance across the potentially (tens of ) thousands of queries that

may be running at any one time. This is not to say that this is impossible to resolve query

performance using the traditional row-based approach, but the challenge is much greater.

Unpredictable Queries: A column is equivalent to an index but without any of the

overhead incurred by having to define an index. It's as if you had a conventional database

with an index on every column. Suffice it to say, therefore, that if you are undertaking

some exploratory analysis using unpredictable queries, then these should run just as

quickly as predictable ones when using a column-based approach. Moreover, all sorts of

queries (with the exception of row-based look-up queries) will run faster than when using

a traditional approach, all other things being equal, precisely because of the reduced I/O

overheads.

Complex Queries: Complex queries tend to be slow or, in some cases, simply not

achievable: not because of their complexity per se but because they combine elements

of unpredictable queries and time-based or quantitative/qualitative queries, and they

frequently require whole table scans. Column-based approaches make complex queries

feasible precisely because they optimize the capability of the data store in all of these areas.

Large Table Scans: It's usually the case that queries are only interested in a limited

subset of the data in each row. However, when using a traditional approach it is necessary

to read each row in its entirety. This is wasteful in the extreme. Column-based approaches

simply read the relevant data from each column.

Let us take the example of the following query: List the full name and email address

for customers born in July. Then if the row consists of 3,200 bytes and there are ten

million rows then the total read requirement for a conventional relational database is

32,000,000,000 bytes. However, if we assume that the date-of-birth field consists of four

bytes, and the full name and email addresses both consist of 25 characters, then the total

amount of data that needs to be read from each row is just 54 bytes if you are using a

column-based approach. This makes a total read requirement of 540,000,000 bytes.

This represents a reduction of 59.26 times, and this is before we take other factors

into account. So it is hardly surprising then that column-based approaches provide

dramatically improved performance.

Big Data Imperatives

Search WWH ::

Custom Search

Home