Information Technology Reference
In-Depth Information
bottlenecks that exist because of the sharing, because all I/O and memory
requests are transferred (and satisfied) over the same bus. As more proces-
sors are added, the synchronization and communication needs increase
exponentially, and therefore the bus is less able to handle the increased need
for bandwidth. This means that unless the need for bandwidth is satisfied,
there will be limits to the degree of scalability.
In contrast, in a shared-nothing approach, each processor has its own dedi-
cated disk storage. This approach, which maps nicely to an MPP architecture,
is not only more suitable to discrete allocation and distribution of the data, it
enables more effective parallelization and consequently does not introduce
the same kind of bus bottlenecks from which the SMP/shared-memory and
shared-disk approaches suffer. Most big data appliances use a collection of
computing resources, typically a combination of processing nodes and stor-
age nodes.
21.2.2 Row versus Column-Oriented Data Layouts
Most traditional database systems employ a row-oriented layout, in which
all the values associated with a specific row are laid out consecutively in
memory. That layout may work well for transaction processing applications
that focus on updating specific records associated with a limited number of
transactions (or transaction steps) at a time; these are manifested as algorith-
mic scans that are performed using multiway joins. Accessing whole rows
at a time when only the values of a smaller set of columns are needed may
flood the network with extraneous data that are not immediately needed and
ultimately will increase the execution time.
Big data analytics applications scan, aggregate, and summarize over mas-
sive data sets. Analytical applications and queries will only need to access the
data elements needed to satisfy join conditions. With row-oriented layouts,
the entire record must be read in order to access the required attributes, with
significantly more data read than is needed to satisfy the request. Also, the
row-oriented layout is often misaligned with the characteristics of the dif-
ferent types of memory systems (core, cache, disk, etc.), leading to increased
access latencies. Subsequently, row-oriented data layouts will not enable the
types of joins or aggregations typical of analytic queries to execute with the
anticipated level of performance.
Hence, a number of appliances for big data use a database management
system that uses an alternate, columnar layout for data that can help to reduce
the negative performance impacts of data latency that plague databases with
a row-oriented data layout. The values for each column can be stored sepa-
rately, and because of this, for any query, the system is able to selectively
access the specific column values requested to evaluate the join conditions.
Instead of requiring separate indexes to tune queries, the data values them-
selves within each column form the index. This speeds up data access while
Search WWH ::




Custom Search