Databases Reference
In-Depth Information
Returning to the example discussed above, we see that a columnar database would
not only eliminate 43 days of data, it would also eliminate 28 columns of data. Returning
only the columns for toasters and units sold, the columnar database would fetch only
14 million data elements, or 93 percent less data. By returning so much less data, columnar
databases are much faster than row-based databases when analyzing large data sets.
Combination/Workload Challenges
The issue here is combining high performance for individual queries with similarly high
performance across multiple queries and query types, some of which may be very short
running queries and others of which may be long running, or anything in between.
Big data scale throws open the environment to address such a combination of workloads.
From our discussions around columnar databases, we can see there is a clear architectural
benefit to be gained if we use a column-based approach. This is because you don't have
to worry about the performance of individual queries and can focus your design efforts
to ensure high performance across the potentially (tens of ) thousands of queries that
may be running at any one time. This is not to say that this is impossible to resolve query
performance using the traditional row-based approach, but the challenge is much greater.
Unpredictable Queries: A column is equivalent to an index but without any of the
overhead incurred by having to define an index. It's as if you had a conventional database
with an index on every column. Suffice it to say, therefore, that if you are undertaking
some exploratory analysis using unpredictable queries, then these should run just as
quickly as predictable ones when using a column-based approach. Moreover, all sorts of
queries (with the exception of row-based look-up queries) will run faster than when using
a traditional approach, all other things being equal, precisely because of the reduced I/O
overheads.
Complex Queries: Complex queries tend to be slow or, in some cases, simply not
achievable: not because of their complexity per se but because they combine elements
of unpredictable queries and time-based or quantitative/qualitative queries, and they
frequently require whole table scans. Column-based approaches make complex queries
feasible precisely because they optimize the capability of the data store in all of these areas.
Large Table Scans: It's usually the case that queries are only interested in a limited
subset of the data in each row. However, when using a traditional approach it is necessary
to read each row in its entirety. This is wasteful in the extreme. Column-based approaches
simply read the relevant data from each column.
Let us take the example of the following query: List the full name and email address
for customers born in July. Then if the row consists of 3,200 bytes and there are ten
million rows then the total read requirement for a conventional relational database is
32,000,000,000 bytes. However, if we assume that the date-of-birth field consists of four
bytes, and the full name and email addresses both consist of 25 characters, then the total
amount of data that needs to be read from each row is just 54 bytes if you are using a
column-based approach. This makes a total read requirement of 540,000,000 bytes.
This represents a reduction of 59.26 times, and this is before we take other factors
into account. So it is hardly surprising then that column-based approaches provide
dramatically improved performance.
 
Search WWH ::




Custom Search