Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

Example 5-36. JSON loading with Spark SQL in Java

SchemaRDD tweets = hiveCtx . jsonFile ( jsonFile );

tweets . registerTempTable ( "tweets" );

SchemaRDD results = hiveCtx . sql ( "SELECT user.name, text FROM tweets" );

We discuss more about how to load JSON data with Spark SQL and access its schema

in “JSON” on page 172 . In addition, Spark SQL supports quite a bit more than loading

data, including querying the data, combining it in more complex ways with RDDs,

and running custom functions over it, which we will cover in Chapter 9 .

Databases

Spark can access several popular databases using either their Hadoop connectors or

custom Spark connectors. In this section, we will show four common connectors.

Java Database Connectivity

Spark can load data from any relational database that supports Java Database Con‐

nectivity (JDBC), including MySQL, Postgres, and other systems. To access this data,

we construct an org.apache.spark.rdd.JdbcRDD and provide it with our SparkCon‐

text and the other parameters. Example 5-37 walks you through using JdbcRDD for a

MySQL database.

Example 5-37. JdbcRDD in Scala

def createConnection () = {

Class . forName ( "com.mysql.jdbc.Driver" ). newInstance ();

DriverManager . getConnection ( "jdbc:mysql://localhost/test?user=holden" );

}

def extractValues ( r : ResultSet ) = {

( r . getInt ( 1 ), r . getString ( 2 ))

}

val data = new JdbcRDD ( sc ,

createConnection , "SELECT * FROM panda WHERE ? <= id AND id <= ?" ,

lowerBound = 1 , upperBound = 3 , numPartitions = 2 , mapRow = extractValues )

println ( data . collect (). toList )

JdbcRDD takes several parameters:

• First, we provide a function to establish a connection to our database. This lets

each node create its own connection to load data over, after performing any con‐

figuration required to connect.

Search WWH ::

Custom Search

Home