Data Models - Mastering DynamoDB

Database Reference

In-Depth Information

Parallel scan

DynamoDB's scan operation processes the data sequentially from the table, which means

that, for a complete table scan, DynamoDB first retrieves 1 MB of data, returns it, and then

goes and scans the next 1 MB of data, which is quite a nasty and time-consuming way of

dealing with huge table scans.

Though Dynamo stores data on multiple logical partitions, a scan operation can only work

on one partition at a time. This type of restriction leads to underutilization of the provi-

sioned throughput.

To address all these issues, DynamoDB introduced the parallel scan, which divides the

table into multiple segments, and multiple threads work on a single segment at a time. Here

multiple threads and processes are invoked together, and each retrieve 1 MB of data every

time. The process that works upon each segment is called a worker. To issue a parallel scan,

you need to provide the TotalSegments value; TotalSegments is simply the num-

ber of workers going to access a certain table in parallel.

Suppose you have three workers, then you need to invoke the scan command in the fol-

lowing manner:

Scan (TotalSegments = 3, Segment = 0,..)

Scan (TotalSegments = 3, Segment = 1,..)

Scan (TotalSegments = 3, Segment = 2,..)

Here, we would logically divide the table into three segments, and each thread would scan

a dedicated segment only. The following is the pictorial representation of a parallel scan:

Search WWH ::

Custom Search

Home