Database Reference
In-Depth Information
then pass the object to Spark. For example,
“HBase” on page 96
shows how to use
newA
PIHadoopDataset
to load data from HBase.
Example: Protocol buffers
(RPCs) and have since been open sourced. Protocol buffers (PBs) are structured data,
with the fields and types of fields being clearly defined. They are optimized to be fast
for encoding and decoding and also take up the minimum amount of space.
Compared to XML, PBs are 3× to 10× smaller and can be 20× to 100× faster to
encode and decode. While a PB has a consistent encoding, there are multiple ways to
create a file consisting of many PB messages.
Protocol buffers are defined using a domain-specific language, and then the protocol
buffer compiler can be used to generate accessor methods in a variety of languages
(including all those supported by Spark). Since PBs aim to take up a minimal amount
of space they are not “self-describing,” as encoding the description of the data would
take up additional space. This means that to parse data that is formatted as PB, we
need the protocol buffer definition to make sense of it.
PBs consist of fields that can be either optional, required, or repeated. When you're
parsing data, a missing optional field does not result in a failure, but a missing
required field results in failing to parse the data. Therefore, when you're adding new
fields to existing protocol buffers it is good practice to make the new fields optional,
as not everyone will upgrade at the same time (and even if they do, you might want to
read your old data).
PB fields can be many predefined types, or another PB message. These types include
string
,
int32
,
enums
, and more. This is by no means a complete introduction to pro‐
tocol buffers, so if you are interested you should consult the
Protocol Buffers website
.
In
Example 5-27
we will look at loading many
VenueResponse
objects from a simple
protocol buffer format. The sample
VenueResponse
is a simple format with one
repeated field, containing another message with required, optional, and enumeration
fields.
Example 5-27. Sample protocol buffer definition
message
Venue
{
required
int32
id
=
1
;
required
string
name
=
2
;
required
VenueType
type
=
3
;
optional
string
address
=
4
;
5
Sometimes called
pbs
or
protobufs
.