Database Reference
In-Depth Information
The top few lines of the data from my CSV file using the Linux head command are as follows:
[root@hc2nn fuel_consumption]# head scala.csv
MODEL,MANUFACTURER,MODEL,VEHICLE CLASS,ENGINE SIZE,CYLINDERS,TRANSMISSION,FUEL,FUEL
CONSUMPTION,,,,FUEL,CO2 EMISSIONS
YEAR,,,,(L),,,TYPE,CITY (L/100 km),HWY (L/100 km),CITY (mpg),HWY (mpg),(L/year),(g/km)
2014,ACURA,ILX,COMPACT,2,4,AS5,Z,8.6,5.6,33,50,1440,166
2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,9.8,6.5,29,43,1660,191
2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,5,4.8,56,59,980,113
2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,11.2,7.7,25,37,1920,221
I first copy this file to the /tmp directory on HDFS, so that Scala can access it, by using the HDFS file system
copyFromLocal command:
[root@hc2nn fuel_consumption]# hdfs dfs -copyFromLocal scala.csv /tmp/scala.csv
Note that when the Spark shell is used, the special variable sc is created. Called a Spark Context , the variable
describes the connection to the Spark cluster. So, at my Scala shell script prompt (scala >), I use the sc variable in the
following command to read the scala.csv file into memory:
scala> val myFile = sc.textFile("/tmp/scala.csv")
14/09/09 19:55:21 INFO storage.MemoryStore: ensureFreeSpace(74240) called with curMem=155704,
maxMem=309225062
14/09/09 19:55:21 INFO storage.MemoryStore: Block broadcast_1 stored as values to memory (estimated
size 72.5 KB, free 294.7 MB)
myFile: org.apache.spark.rdd.RDD[String] = MappedRDD[3] at textFile at <console>:12
The next command produces a line count on the file, now represented by the variable myFile , in memory:
scala> myFile.count()
14/09/09 19:55:41 INFO spark.SparkContext: Job finished: count at <console>:15, took 3.174464234 s
res1: Long = 1069
The result indicates that there are 1,069 lines in the file. The Spark-based line count can be checked against the
original file on the Linux file system. To do so, I use the Linux wc (word count) command with a -l switch to confirm
the count of 1,069 lines:
[root@hc2nn fuel_consumption]# wc -l scala.csv
1069 scala.csv
The following Spark shell Scala command counts the number of instances of the string “ACURA” in the
in-memory file:
scala> myFile.filter(line => line.contains("ACURA")).count()
14/09/09 19:58:10 INFO spark.SparkContext: Job finished: count at <console>:15, took 2.815524655 s
res0: Long = 12
 
Search WWH ::




Custom Search