Big Data Computing Applications - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

bottlenecks that exist because of the sharing, because all I/O and memory

requests are transferred (and satisfied) over the same bus. As more proces-

sors are added, the synchronization and communication needs increase

exponentially, and therefore the bus is less able to handle the increased need

for bandwidth. This means that unless the need for bandwidth is satisfied,

there will be limits to the degree of scalability.

In contrast, in a shared-nothing approach, each processor has its own dedi-

cated disk storage. This approach, which maps nicely to an MPP architecture,

is not only more suitable to discrete allocation and distribution of the data, it

enables more effective parallelization and consequently does not introduce

the same kind of bus bottlenecks from which the SMP/shared-memory and

shared-disk approaches suffer. Most big data appliances use a collection of

computing resources, typically a combination of processing nodes and stor-

age nodes.

21.2.2 Row versus Column-Oriented Data Layouts

Most traditional database systems employ a row-oriented layout, in which

all the values associated with a specific row are laid out consecutively in

memory. That layout may work well for transaction processing applications

that focus on updating specific records associated with a limited number of

transactions (or transaction steps) at a time; these are manifested as algorith-

mic scans that are performed using multiway joins. Accessing whole rows

at a time when only the values of a smaller set of columns are needed may

flood the network with extraneous data that are not immediately needed and

ultimately will increase the execution time.

Big data analytics applications scan, aggregate, and summarize over mas-

sive data sets. Analytical applications and queries will only need to access the

data elements needed to satisfy join conditions. With row-oriented layouts,

the entire record must be read in order to access the required attributes, with

significantly more data read than is needed to satisfy the request. Also, the

row-oriented layout is often misaligned with the characteristics of the dif-

ferent types of memory systems (core, cache, disk, etc.), leading to increased

access latencies. Subsequently, row-oriented data layouts will not enable the

types of joins or aggregations typical of analytic queries to execute with the

anticipated level of performance.

Hence, a number of appliances for big data use a database management

system that uses an alternate, columnar layout for data that can help to reduce

the negative performance impacts of data latency that plague databases with

a row-oriented data layout. The values for each column can be stored sepa-

rately, and because of this, for any query, the system is able to selectively

access the specific column values requested to evaluate the join conditions.

Instead of requiring separate indexes to tune queries, the data values them-

selves within each column form the index. This speeds up data access while

Search WWH ::

Custom Search

Home