Database Reference
In-Depth Information
Logical type annotation
Description
Schema example
}
}
Complex types in Parquet are created using the group type, which adds a layer of nest-
ing. [ 87 ] A group with no annotation is simply a nested record.
Lists and maps are built from groups with a particular two-level group structure, as shown
in Table 13-2 . A list is represented as a LIST group with a nested repeating group (called
list ) that contains an element field. In this example, a list of 32-bit integers has a re-
quired int32 element field. For maps, the outer group a (annotated MAP ) contains an in-
ner repeating group key_value that contains the key and value fields. In this example,
the values have been marked optional so that it's possible to have null values in the
map.
Nested Encoding
In a column-oriented store, a column's values are stored together. For a flat table where
there is no nesting and no repetition — such as the weather record schema — this is
simple enough since each column has the same number of values, making it straightfor-
ward to determine which row each value belongs to.
In the general case where there is nesting or repetition — such as the map schema — it is
more challenging, since the structure of the nesting needs to be encoded too. Some colum-
nar formats avoid the problem by flattening the structure so that only the top-level
columns are stored in column-major fashion (this is the approach that Hive's RCFile
takes, for example). A map with nested columns would be stored in such a way that the
keys and values are interleaved, so it would not be possible to read only the keys, say,
without also reading the values into memory.
Parquet uses the encoding from Dremel, where every primitive type field in the schema is
stored in a separate column, and for each value written, the structure is encoded by means
of two integers: the definition level and the repetition level. The details are intricate, [ 88 ]
but you can think of storing definition and repetition levels like this as a generalization of
using a bit field to encode null s for a flat record, where the non- null values are written
one after another.
The upshot of this encoding is that any column (even nested ones) can be read independ-
ently of the others. In the case of a Parquet map, for example, the keys can be read
Search WWH ::




Custom Search