Analytics with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

This example uses the same CSV file as was used in a previous example. The basic steps are to create a SQL

Context object in the Spark shell session to make Spark SQL functionality available, then import the data into a table

and run some SQL against it. Here is the CSV file data in the file /tmp/scala.csv. I have removed its header rows so that

it contains only raw data:

[root@hc2nn fuel_consumption]# head scala.csv

2014,ACURA,ILX,COMPACT,2,4,AS5,Z,8.6,5.6,33,50,1440,166

2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,9.8,6.5,29,43,1660,191

2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,5,4.8,56,59,980,113

2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,11.2,7.7,25,37,1920,221

To use SQL in Spark, I enter the following command into my Spark shell:

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

This creates a SqlContext from the default Spark context sc , using the SQLContext class. Next, I import the

sqlContext library functionality so that it is available for the rest of the script:

scala> import sqlContext._

I define the schema using a case class to represent all of the comma-separated fields in the data file line. This

defines the number of fields, their name, order, and type:

scala> case class Vehicle(year: Int,manufacturer: String, model: String, vclass: String, engine:

Double, cylinders: Int, fuel: String, consumption: String, clkm: Double, hlkm: Double, cmpg: Int,

hmpg: Int, co2lyr: Int, co2gkm: Int)

■

Note

the column types are case-sensitive; for instance, “int” will cause an error while “int” will not.

The data types used here are Int , String , and Double to represent the data vaues in the CSV file. The order is

important, as it should match the data in the CSV file row. Also, reserved words need to be avoided, so I have used the

column name vclass to describe my vehicle class.

Now, I create an RDD from the vehicle record, import the CSV file /tmp/scala.csv, split the file by the comma

character, and convert the data columns by type to match the columns in the schema above:

scala> val vehicle = sc.textFile("/tmp/scala.csv").map(_.split(",")).map(p => Vehicle(

p(0).trim.toInt,

p(1),

p(2),

p(3),

p(4).trim.toDouble,

p(5).trim.toInt,

p(6),

p(7),

Search WWH ::

Custom Search

Home