Database Reference
In-Depth Information
The Hive Metastore
When data is loaded from HDFS into Hive, it is necessary to describe the schema of
the data. Hive keeps track of the schema, location, and other data about various inputs
to its tables by using a relational database called the metastore .
By default, Hive provides its own embedded metastore (powered by the relational
database Apache Derby). This default database makes it easy to start using Hive with-
out having to do much work, but it has some limitations. The embedded metastore
can only be used with one Hive session at a time, so multiple users will not be able to
work together. When using Hive in production with multiple users and large datasets,
the best solution to this problem is to set up an external relational database, such as
MySQL, to act as the metastore.
Loading Data into Hive
With Hive, “loading” is a bit of a loaded term. Hive is relatively data agnostic, with
the ability to support queries over a range of source formats, including raw text files,
Hadoop Sequence files (the key-value format used for intermediate MapReduce data
processing), and specialized columnar formats.
Hive's basic unit of data is the table. A table in Hive acts much like a table in a rela-
tional database: a two-dimensional collection of columns of different datatypes, with
rows containing records. Also like relational databases, Hive tables can be organized
into distinct “databases,” which act as namespaces to hold specific table names. In
other words, two different databases can have the same table names; they won't clash.
Like relational databases, Hive supports a number of primitive data types for each
field, including various forms of integers, Booleans, f loating-point decimals, and
strings. Hive also supports arrays and structs of values, Unix timestamp values, and a
number of mathematical functions as basic data types.
Hive has two main concepts of how data is controlled: managed and external. This
distinction determines whether Hive is responsible for deleting source data when tables
are dropped. A common use case for Hive is to operate over data that is also being
used by other applications, such as custom MapReduce code. In this case, you want
Hive to be able to access the data in place. Also, if you want to drop a table, Hive will
only delete its own references to it, not the actual data itself. External tables can be
configured by using the EXTERNAL modifier when creating new tables.
Hive natively supports several file formats. Besides text, Hive can also use Hadoop's
SequenceFile format, which is the native key-value format used to keep track of inter-
mediate data in a MapReduce f low. For better performance, Hive can read files in
a format called Record Columnar File (RCFile). We'll take a look at how to create
RCFile tables later in this chapter. Listing 5.1 provides examples of creating Hive tables.
Listing 5.1 Creating Hive tables using data from local and HDFS sources
/* Create a Hive-managed table. Deleting this table will remove
both the table metadata and the data itself. */
CREATE TABLE employee_ids (name STRING, id INT);
LOAD DATA INPATH '/users/ids.csv' INTO TABLE employee_ids;
 
Search WWH ::




Custom Search