Database Reference
In-Depth Information
Example 9-36. Python string length UDF
# Make a UDF to tell us how long some text is
hiveCtx
.
registerFunction
(
"strLenPython"
,
lambda
x
:
len
(
x
),
IntegerType
())
lengthSchemaRDD
=
hiveCtx
.
sql
(
"SELECT strLenPython('text') FROM tweets LIMIT 10"
)
Example 9-37. Scala string length UDF
registerFunction
(
"strLenScala"
,
(
_:
String
).
length
)
val
tweetLength
=
hiveCtx
.
sql
(
"SELECT strLenScala('tweet') FROM tweets LIMIT 10"
)
There are some additional imports for Java to define UDFs. As with the functions we
defined for RDDs we extend a special class. Depending on the number of parameters
we extend
UDF[N]
, as shown in Examples
9-38
and
9-39
.
Example 9-38. Java UDF imports
// Import UDF function class and DataTypes
// Note: these import paths may change in a future release
import
org.apache.spark.sql.api.java.UDF1
;
import
org.apache.spark.sql.types.DataTypes
;
Example 9-39. Java string length UDF
hiveCtx
.
udf
().
register
(
"stringLengthJava"
,
new
UDF1
<
String
,
Integer
>()
{
@Override
public
Integer
call
(
String
str
)
throws
Exception
{
return
str
.
length
();
}
},
DataTypes
.
IntegerType
);
SchemaRDD
tweetLength
=
hiveCtx
.
sql
(
"SELECT stringLengthJava('text') FROM tweets LIMIT 10"
);
List
<
Row
>
lengths
=
tweetLength
.
collect
();
for
(
Row
row
:
result
)
{
System
.
out
.
println
(
row
.
get
(
0
));
}
Hive UDFs
Spark SQL can also use existing Hive UDFs. The standard Hive UDFs are already
automatically included. If you have a custom UDF, it is important to make sure that
the JARs for your UDF are included with your application. If we run the JDBC server,
note that we can add this with the
--jars
command-line flag. Developing Hive UDFs
is beyond the scope of this topic, so we will instead introduce how to use existing
Hive UDFs.