Database Reference
In-Depth Information
ensure that the subsets of L and R with the same join key values can be processed on
the same machine. Hence, for joining arbitrary data sets on arbitrary keys we gener-
ally have to shuffle data over the network or choose appropriate prepartitioning and
replication strategies.
The most prominent and flexible join technique in MapReduce is called reduce-side
join [14,15]. Some literature also refer to it as repartition join [14] as the idea is based
on reading both data sets (map phase) and repartition them according to the join key
(shuffle phase). The actual join computation is done in the reduce phase. The main
drawback of this approach is that both data sets are completely transferred over the net-
work regardless of the join output. This is especially inefficient for selective joins and
consumes a lot of network bandwidth. Another group of joins is based on getting rid of
the shuffle and reduce phase to avoid transferring both data sets over the network. This
kind of join technique is called map-side join since the actual join processing is done
in the map phase. The most common one is the map-side merge join [15]. However,
this join cannot be applied on arbitrary data sets since a preprocessing step is neces-
sary to fulfill several requirements: data sets have to be sorted and equally partitioned
according to the join key. If the preconditions are fulfilled, the map phase can process
an efficient parallel merge join between presorted partitions and data shuffling is not
necessary. In a sequence of such joins, the shuffle and reduce phases are indeed needed
to fulfill the preconditions for the next join iteration. Therefore, map-side joins are gen-
erally hard to cascade and the advantage of avoiding a shuffle and reduce phase is lost.
In Section 5.6, we present our MAPSIN join approach that is designed to overcome
this drawback by using the distributed index of the NoSQL data store HBase.
5.2.3 P ig l atin
Pig Latin [8] is a language for the analysis of very large data sets based on Apache
Hadoop developed by Yahoo! Research. The implementation of Pig Latin for
Hadoop, Pig , is an Apache top-level project that automatically translates a Pig Latin
program into a series of MapReduce jobs.
Data Model: Pig Latin has a fully nested data model that allows more flexibility
than at tables required by the first normal form in relational databases. The data
model of Pig Latin provides four different types:
Atom : Contains a simple atomic value like a string or number, for example,
'Sarah' or 24 .
Tuple : Sequence of fields of any type. Every field can have a name (alias)
that can be used to reference the field, for example, ('John' , 'Doe') with
alias ( (firstname, , lastname ).
Bag : Collection of tuples with possible duplicates. The schemas of the tuples
do not have to match, that is, the number and types of fields can differ.
('Bob','Sarah')
'Peter',('likes','football')
(
)
Search WWH ::




Custom Search