Database Reference
In-Depth Information
then pass the object to Spark. For example, “HBase” on page 96 shows how to use newA
PIHadoopDataset to load data from HBase.
Example: Protocol buffers
Protocol buffers 5 were first developed at Google for internal remote procedure calls
(RPCs) and have since been open sourced. Protocol buffers (PBs) are structured data,
with the fields and types of fields being clearly defined. They are optimized to be fast
for encoding and decoding and also take up the minimum amount of space.
Compared to XML, PBs are 3× to 10× smaller and can be 20× to 100× faster to
encode and decode. While a PB has a consistent encoding, there are multiple ways to
create a file consisting of many PB messages.
Protocol buffers are defined using a domain-specific language, and then the protocol
buffer compiler can be used to generate accessor methods in a variety of languages
(including all those supported by Spark). Since PBs aim to take up a minimal amount
of space they are not “self-describing,” as encoding the description of the data would
take up additional space. This means that to parse data that is formatted as PB, we
need the protocol buffer definition to make sense of it.
PBs consist of fields that can be either optional, required, or repeated. When you're
parsing data, a missing optional field does not result in a failure, but a missing
required field results in failing to parse the data. Therefore, when you're adding new
fields to existing protocol buffers it is good practice to make the new fields optional,
as not everyone will upgrade at the same time (and even if they do, you might want to
read your old data).
PB fields can be many predefined types, or another PB message. These types include
string , int32 , enums , and more. This is by no means a complete introduction to pro‐
tocol buffers, so if you are interested you should consult the Protocol Buffers website .
In Example 5-27 we will look at loading many VenueResponse objects from a simple
protocol buffer format. The sample VenueResponse is a simple format with one
repeated field, containing another message with required, optional, and enumeration
fields.
Example 5-27. Sample protocol buffer definition
message Venue {
required int32 id = 1 ;
required string name = 2 ;
required VenueType type = 3 ;
optional string address = 4 ;
5 Sometimes called pbs or protobufs .
 
Search WWH ::




Custom Search