Database Reference
In-Depth Information
/* Create an external table. Deleting this table will leave the
data intact but remove the table's metadata. */
CREATE EXTERNAL TABLE employee_ids (name STRING, id INT)
LOCATION '/external/employee_ids';
LOAD DATA INPATH '/users/ids.csv' INTO TABLE employee_ids;
Querying Data: HiveQL
Now that we've created tables using Hive to keep track of our input files, we can
start asking questions about the data. Hive provides an SQL-like language known as
HiveQL, which enables functionality (such as GROUP BY , JOIN s, and HAVING) famil-
iar to users of standard SQL.
HiveQL can't express every type of query that is possible with standard SQL. The
differences between standard SQL and HiveQL mostly stem from the differences
between the design of relational databases and the MapReduce framework. As Hive is
not meant to do appends to existing datasets, transactions are not supported, nor are
materialized views. However, Hive is able to take advantage of some of the unique
benefits of MapReduce. One of the most useful is the multiple insertion feature, which
enables Hive queries to produce multiple output tables from a single query (rather than
just one, which is the normal result of an SQL query). This is more efficient than run-
ning separate queries, as Hive will only scan the source table once during a multiple
insertion. An example of a multiple-insertion query can be seen in Listing 5.2.
Listing 5.2 Using a multiple-insertion query in Hive
FROM bookstore
INSERT OVERWRITE TABLE book
SELECT author, SUM(revenue) AS total
WHERE category='book' GROUP BY author
INSERT OVERWRITE TABLE comics
SELECT author, SUM(revenue) AS total
WHERE category='comic' GROUP BY author;
Another difference between the standard SQL of a relational database and HiveQL
is the absence of inequality JOIN s. The MapReduce model is not well-designed to
compare keys that do not match. Therefore, it is only possible with Hive to run JOIN
queries between two tables when a particular key equals another.
These examples merely scratch the surface of what is possible with Hive. For prac-
tical explorations of asking questions about datasets, it's possible to run Hadoop and
Hive on a single machine, using either the local filesystem or a local Hadoop filesys-
tem for testing. Scaling the system up to use an external MySQL database to act as a
remote Hive metastore is slightly more work (mostly due to configuring proper per-
missions on the MySQL database).
The ease of setting up Hive on a small installation makes it a tempting choice for
small datasets, but if your data sizes are such that a single server will suffice, then it is
 
 
Search WWH ::




Custom Search