Analytics with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

The top few lines of the data from my CSV file using the Linux head command are as follows:

[root@hc2nn fuel_consumption]# head scala.csv

MODEL,MANUFACTURER,MODEL,VEHICLE CLASS,ENGINE SIZE,CYLINDERS,TRANSMISSION,FUEL,FUEL

CONSUMPTION,,,,FUEL,CO2 EMISSIONS

YEAR,,,,(L),,,TYPE,CITY (L/100 km),HWY (L/100 km),CITY (mpg),HWY (mpg),(L/year),(g/km)

2014,ACURA,ILX,COMPACT,2,4,AS5,Z,8.6,5.6,33,50,1440,166

2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,9.8,6.5,29,43,1660,191

2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,5,4.8,56,59,980,113

2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,11.2,7.7,25,37,1920,221

I first copy this file to the /tmp directory on HDFS, so that Scala can access it, by using the HDFS file system

copyFromLocal command:

[root@hc2nn fuel_consumption]# hdfs dfs -copyFromLocal scala.csv /tmp/scala.csv

Note that when the Spark shell is used, the special variable sc is created. Called a Spark Context , the variable

describes the connection to the Spark cluster. So, at my Scala shell script prompt (scala >), I use the sc variable in the

following command to read the scala.csv file into memory:

scala> val myFile = sc.textFile("/tmp/scala.csv")

14/09/09 19:55:21 INFO storage.MemoryStore: ensureFreeSpace(74240) called with curMem=155704,

maxMem=309225062

14/09/09 19:55:21 INFO storage.MemoryStore: Block broadcast_1 stored as values to memory (estimated

size 72.5 KB, free 294.7 MB)

myFile: org.apache.spark.rdd.RDD[String] = MappedRDD[3] at textFile at <console>:12

The next command produces a line count on the file, now represented by the variable myFile , in memory:

scala> myFile.count()

14/09/09 19:55:41 INFO spark.SparkContext: Job finished: count at <console>:15, took 3.174464234 s

res1: Long = 1069

The result indicates that there are 1,069 lines in the file. The Spark-based line count can be checked against the

original file on the Linux file system. To do so, I use the Linux wc (word count) command with a -l switch to confirm

the count of 1,069 lines:

[root@hc2nn fuel_consumption]# wc -l scala.csv

1069 scala.csv

The following Spark shell Scala command counts the number of instances of the string “ACURA” in the

in-memory file:

scala> myFile.filter(line => line.contains("ACURA")).count()

14/09/09 19:58:10 INFO spark.SparkContext: Job finished: count at <console>:15, took 2.815524655 s

res0: Long = 12

Search WWH ::

Custom Search

Home