Database Reference
In-Depth Information
Serialization
Serialization is the process of turning structured objects into a byte stream for transmission
over a network or for writing to persistent storage. Deserialization is the reverse process of
turning a byte stream back into a series of structured objects.
Serialization is used in two quite distinct areas of distributed data processing: for interpro-
cess communication and for persistent storage.
In Hadoop, interprocess communication between nodes in the system is implemented using
remote procedure calls (RPCs). The RPC protocol uses serialization to render the message
into a binary stream to be sent to the remote node, which then deserializes the binary
stream into the original message. In general, it is desirable that an RPC serialization format
is:
Compact
A compact format makes the best use of network bandwidth, which is the most scarce
resource in a data center.
Fast
Interprocess communication forms the backbone for a distributed system, so it is essen-
tial that there is as little performance overhead as possible for the serialization and
deserialization process.
Extensible
Protocols change over time to meet new requirements, so it should be straightforward to
evolve the protocol in a controlled manner for clients and servers. For example, it
should be possible to add a new argument to a method call and have the new servers ac-
cept messages in the old format (without the new argument) from old clients.
Interoperable
For some systems, it is desirable to be able to support clients that are written in different
languages to the server, so the format needs to be designed to make this possible.
On the face of it, the data format chosen for persistent storage would have different require-
ments from a serialization framework. After all, the lifespan of an RPC is less than a
second, whereas persistent data may be read years after it was written. But it turns out, the
four desirable properties of an RPC's serialization format are also crucial for a persistent
storage format. We want the storage format to be compact (to make efficient use of storage
space), fast (so the overhead in reading or writing terabytes of data is minimal), extensible
Search WWH ::




Custom Search