Database Reference
In-Depth Information
From here you can also write SQL to query the data. The Beeline shell is great for
quick data exploration on cached tables shared by multiple users.
Long-Lived Tables and Queries
One of the advantages of using Spark SQL's JDBC server is we can share cached
tables between multiple programs. This is possible since the JDBC Thrift server is a
single driver program. To do this, you only need to register the table and then run the
CACHE command on it, as shown in the previous section.
Standalone Spark SQL Shell
Apart from its JDBC server, Spark SQL also supports a simple shell
you can use as a single process, available through ./bin/spark-sql .
This shell connects to the Hive metastore you have set in conf/hive-
site.xml , if one exists, or creates one locally. It is most useful for
local development; in a shared cluster, you should instead use the
JDBC server and have users connect with beeline .
User-Defined Functions
User-defined functions, or UDFs, allow you to register custom functions in Python,
Java, and Scala to call within SQL. They are a very popular way to expose advanced
functionality to SQL users in an organization, so that these users can call into it
without writing code. Spark SQL makes it especially easy to write UDFs. It supports
both its own UDF interface and existing Apache Hive UDFs.
Spark SQL UDFs
Spark SQL offers a built-in method to easily register UDFs by passing in a function in
your programming language. In Scala and Python, we can use the native function
and lambda syntax of the language, and in Java we need only extend the appropriate
UDF class. Our UDFs can work on a variety of types, and we can return a different
type than the one we are called with.
In Python and Java we also need to specify the return type using one of the Sche‐
maRDD types, listed in Table 9-1 . In Java these types are found in
org.apache.spark.sql.api.java.DataType and in Python we import the DataType .
In Examples 9-36 and 9-37 , a very simple UDF computes the string length, which we
can use to find out the length of the tweets we're using.
Search WWH ::




Custom Search