Database Reference
In-Depth Information
Example 5-36. JSON loading with Spark SQL in Java
SchemaRDD
tweets
=
hiveCtx
.
jsonFile
(
jsonFile
);
tweets
.
registerTempTable
(
"tweets"
);
SchemaRDD
results
=
hiveCtx
.
sql
(
"SELECT user.name, text FROM tweets"
);
We discuss more about how to load JSON data with Spark SQL and access its schema
in
“JSON” on page 172
. In addition, Spark SQL supports quite a bit more than loading
data, including querying the data, combining it in more complex ways with RDDs,
and running custom functions over it, which we will cover in
Chapter 9
.
Databases
Spark can access several popular databases using either their Hadoop connectors or
custom Spark connectors. In this section, we will show four common connectors.
Java Database Connectivity
Spark can load data from any relational database that supports Java Database Con‐
nectivity (JDBC), including MySQL, Postgres, and other systems. To access this data,
we construct an
org.apache.spark.rdd.JdbcRDD
and provide it with our SparkCon‐
text and the other parameters.
Example 5-37
walks you through using
JdbcRDD
for a
MySQL database.
Example 5-37. JdbcRDD in Scala
def
createConnection
()
=
{
Class
.
forName
(
"com.mysql.jdbc.Driver"
).
newInstance
();
DriverManager
.
getConnection
(
"jdbc:mysql://localhost/test?user=holden"
);
}
def
extractValues
(
r
:
ResultSet
)
=
{
(
r
.
getInt
(
1
),
r
.
getString
(
2
))
}
val
data
=
new
JdbcRDD
(
sc
,
createConnection
,
"SELECT * FROM panda WHERE ? <= id AND id <= ?"
,
lowerBound
=
1
,
upperBound
=
3
,
numPartitions
=
2
,
mapRow
=
extractValues
)
println
(
data
.
collect
().
toList
)
JdbcRDD
takes several parameters:
• First, we provide a function to establish a connection to our database. This lets
each node create its own connection to load data over, after performing any con‐
figuration required to connect.