Database Reference
In-Depth Information
As shown earlier, records imported by Sqoop are laid out on disk in a fashion very similar
to a database's internal structure: an array of records with all fields of a record concaten-
ated together. When running a MapReduce program over imported records, each map task
must fully materialize all fields of each record in its input split. If the contents of a large
object field are relevant only for a small subset of the total number of records used as in-
put to a MapReduce program, it would be inefficient to fully materialize all these records.
Furthermore, depending on the size of the large object, full materialization in memory
may be impossible.
To overcome these difficulties, Sqoop will store imported large objects in a separate file
called a LobFile , if they are larger than a threshold size of 16 MB (configurable via the
sqoop.inline.lob.length.max setting, in bytes). The LobFile format can
store individual records of very large size (a 64-bit address space is used). Each record in
a LobFile holds a single large object. The LobFile format allows clients to hold a ref-
erence to a record without accessing the record contents. When records are accessed, this
is done through a java.io.InputStream (for binary objects) or
java.io.Reader (for character-based objects).
When a record is imported, the “normal” fields will be materialized together in a text file,
along with a reference to the LobFile where a CLOB or BLOB column is stored. For ex-
ample, suppose our widgets table contained a BLOB field named schematic holding
the actual schematic diagram for each widget.
An imported record might then look like:
2,gizmo,4.00,2009-11-30,4,null,externalLob(lf,lobfile0,100,5011714)
The externalLob(...) text is a reference to an externally stored large object, stored
in LobFile format ( lf ) in a file named lobfile0 , with the specified byte offset and
length inside that file.
When working with this record, the Widget.get_schematic() method would re-
turn an object of type BlobRef referencing the schematic column, but not actually
containing its contents. The BlobRef.getDataStream() method actually opens the
LobFile and returns an InputStream , allowing you to access the schematic
field's contents.
When running a MapReduce job processing many Widget records, you might need to
access the schematic fields of only a handful of records. This system allows you to in-
cur the I/O costs of accessing only the required large object entries — a big savings, as in-
dividual schematics may be several megabytes or more of data.
Search WWH ::




Custom Search