Database Reference
In-Depth Information
This example uses the same CSV file as was used in a previous example. The basic steps are to create a SQL
Context object in the Spark shell session to make Spark SQL functionality available, then import the data into a table
and run some SQL against it. Here is the CSV file data in the file /tmp/scala.csv. I have removed its header rows so that
it contains only raw data:
[root@hc2nn fuel_consumption]# head scala.csv
2014,ACURA,ILX,COMPACT,2,4,AS5,Z,8.6,5.6,33,50,1440,166
2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,9.8,6.5,29,43,1660,191
2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,5,4.8,56,59,980,113
2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,11.2,7.7,25,37,1920,221
To use SQL in Spark, I enter the following command into my Spark shell:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
This creates a SqlContext from the default Spark context sc , using the SQLContext class. Next, I import the
sqlContext library functionality so that it is available for the rest of the script:
scala> import sqlContext._
I define the schema using a case class to represent all of the comma-separated fields in the data file line. This
defines the number of fields, their name, order, and type:
scala> case class Vehicle(year: Int,manufacturer: String, model: String, vclass: String, engine:
Double, cylinders: Int, fuel: String, consumption: String, clkm: Double, hlkm: Double, cmpg: Int,
hmpg: Int, co2lyr: Int, co2gkm: Int)
Note
the column types are case-sensitive; for instance, “int” will cause an error while “int” will not.
The data types used here are Int , String , and Double to represent the data vaues in the CSV file. The order is
important, as it should match the data in the CSV file row. Also, reserved words need to be avoided, so I have used the
column name vclass to describe my vehicle class.
Now, I create an RDD from the vehicle record, import the CSV file /tmp/scala.csv, split the file by the comma
character, and convert the data columns by type to match the columns in the schema above:
scala> val vehicle = sc.textFile("/tmp/scala.csv").map(_.split(",")).map(p => Vehicle(
p(0).trim.toInt,
p(1),
p(2),
p(3),
p(4).trim.toDouble,
p(5).trim.toInt,
p(6),
p(7),
 
Search WWH ::




Custom Search