Database Reference
In-Depth Information
Now, the DateConv Hive UDF function is ready to be used in a Hive QL query by using just its short name,
DateConv. It will take a date with a structure of dd-MM-yyyy and return a date formatted to yyyy-MM-dd for use in a
Hive table. Next, I show how to invoke this function.
Table Creation
Suppose, as the next step in your ETL chain, you want to extract monthly purchase totals by supplier. You could, as
shown earlier, create a table with
CREATE TABLE
, using the
IF NOT EXISTS
clause to prevent duplication; for example,
I create the following table:
CREATE TABLE IF NOT EXISTS
trade.suppliertot
(
payyear INT,
paymonth INT,
supplier STRING,
totamount DOUBLE
);
Called “suppliertot” and residing in the trade database, this table would have four columns: one for the year of a
transaction, one for the month, one for the the supplier name, and one for the total amount of the purchases from that
supplier (indicated by the comma-separated list within parentheses). The table would not, however, contain any data
yet, as it has no
LOCATION
clause to link it to an HDFS directory.
Alternatively, I could create the table via a
SELECT
statement, which would automatically populate it with data. To
do that, I drop the table first and then run the
CREATE
statement, as follows:
DROP TABLE trade.suppliertot;
CREATE TABLE IF NOT EXISTS
trade.suppliertot
AS
SELECT
year(DateConv (paydate) ) as payyear,
month(DateConv (paydate) ) as paymonth,
supplier,
SUM(amount) as totamount
FROM
trade.rawtrans
GROUP BY
year(DateConv (paydate) ) ,
month(DateConv (paydate) ) ,
supplier ;
This statement uses the DateConv UDF that was created previously to convert date strings with the format dd/
MM/yyyy to the format yyyy-MM-dd; here, it is used against the “paydate” column. I use the same
CREATE TABLE IF
NOT EXISTS
option to create the table trade.suppliertot. However, I replace the list of columns in parentheses with an
AS SELECT
statement that has four columns that takes data from the table trade.rawtrans via a
FROM
clause.
I can confirm that this second table now contains the data by issuing the statement
SELECT COUNT(*) FROM
trade.suppliertot
. This returns the result 391, indicating that the trade.suppliertot table contains 391 rows. The
COUNT(*)
is a special aggregation function that returns the number of rows in a table.
Search WWH ::
Custom Search