Database Reference
In-Depth Information
an example of creating a copy of a table in Parquet format using
CREATE TABLE...AS
SELECT
(see
CREATE TABLE...AS SELECT
):
CREATE TABLE users_parquet STORED AS PARQUET
AS
SELECT * FROM users;
Using a custom SerDe: RegexSerDe
Let's see how to use a custom SerDe for loading data. We'll use a contrib SerDe that uses
a regular expression for reading the fixed-width station metadata from a text file:
CREATE TABLE stations (usaf STRING, wban STRING, name STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*"
);
In previous examples, we have used the
DELIMITED
keyword to refer to delimited text
in the
ROW FORMAT
clause. In this example, we instead specify a SerDe with the
SERDE
keyword and the fully qualified classname of the Java class that implements the SerDe,
org.apache.hadoop.hive.contrib.serde2.RegexSerDe
.
SerDes can be configured with extra properties using the
WITH SERDEPROPERTIES
clause. Here we set the
input.regex
property, which is specific to
RegexSerDe
.
input.regex
is the regular expression pattern to be used during deserialization to turn
the line of text forming the row into a set of columns.
Java regular expression syntax
is
In this example, there are three capturing groups for
usaf
(a six-digit identifier),
wban
(a
five-digit identifier), and
name
(a fixed-width column of 29 characters).
To populate the table, we use a
LOAD DATA
statement as before:
LOAD DATA LOCAL INPATH "input/ncdc/metadata/stations-fixed-width.txt"
INTO TABLE stations;
Recall that
LOAD DATA
copies or moves the files to Hive's warehouse directory (in this
case, it's a copy because the source is the local filesystem). The table's SerDe is not used
for the load operation.
When we retrieve data from the table the SerDe is invoked for deserialization, as we can
see from this simple query, which correctly parses the fields for each row: