Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

an example of creating a copy of a table in Parquet format using CREATE TABLE...AS

SELECT (see CREATE TABLE...AS SELECT ):

CREATE TABLE users_parquet STORED AS PARQUET

AS

SELECT * FROM users;

Using a custom SerDe: RegexSerDe

Let's see how to use a custom SerDe for loading data. We'll use a contrib SerDe that uses

a regular expression for reading the fixed-width station metadata from a text file:

CREATE TABLE stations (usaf STRING, wban STRING, name STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*"

);

In previous examples, we have used the DELIMITED keyword to refer to delimited text

in the ROW FORMAT clause. In this example, we instead specify a SerDe with the SERDE

keyword and the fully qualified classname of the Java class that implements the SerDe,

org.apache.hadoop.hive.contrib.serde2.RegexSerDe .

SerDes can be configured with extra properties using the WITH SERDEPROPERTIES

clause. Here we set the input.regex property, which is specific to RegexSerDe .

input.regex is the regular expression pattern to be used during deserialization to turn

the line of text forming the row into a set of columns. Java regular expression syntax is

used for the matching, and columns are formed from capturing groups of parentheses. [ 115 ]

In this example, there are three capturing groups for usaf (a six-digit identifier), wban (a

five-digit identifier), and name (a fixed-width column of 29 characters).

To populate the table, we use a LOAD DATA statement as before:

LOAD DATA LOCAL INPATH "input/ncdc/metadata/stations-fixed-width.txt"

INTO TABLE stations;

Recall that LOAD DATA copies or moves the files to Hive's warehouse directory (in this

case, it's a copy because the source is the local filesystem). The table's SerDe is not used

for the load operation.

When we retrieve data from the table the SerDe is invoked for deserialization, as we can

see from this simple query, which correctly parses the fields for each row:

Search WWH ::

Custom Search

Home