Database Reference
In-Depth Information
Text Files
Text files are very simple to load from and save to with Spark. When we load a single
text file as an RDD, each input line becomes an element in the RDD. We can also
load multiple whole text files at the same time into a pair RDD, with the key being the
name and the value being the contents of each file.
Loading text files
Loading a single text file is as simple as calling the textFile() function on our
SparkContext with the path to the file, as you can see in Examples 5-1 through 5-3 . If
we want to control the number of partitions we can also specify minPartitions .
Example 5-1. Loading a text file in Python
input = sc . textFile ( "file:///home/holden/repos/spark/README.md" )
Example 5-2. Loading a text file in Scala
val input = sc . textFile ( "file:///home/holden/repos/spark/README.md" )
Example 5-3. Loading a text file in Java
JavaRDD < String > input = sc . textFile ( "file:///home/holden/repos/spark/README.md" )
Multipart inputs in the form of a directory containing all of the parts can be handled
in two ways. We can just use the same textFile method and pass it a directory and it
will load all of the parts into our RDD. Sometimes it's important to know which file
which piece of input came from (such as time data with the key in the file) or we need
to process an entire file at a time. If our files are small enough, then we can use the
SparkContext.wholeTextFiles() method and get back a pair RDD where the key is
the name of the input file.
wholeTextFiles() can be very useful when each file represents a certain time
period's data. If we had files representing sales data from different periods, we could
easily compute the average for each period, as shown in Example 5-4 .
Example 5-4. Average value per file in Scala
val input = sc . wholeTextFiles ( "file://home/holden/salesFiles" )
val result = input . mapValues { y =>
val nums = y . split ( " " ). map ( x => x . toDouble )
nums . sum / nums . size . toDouble
}
Search WWH ::




Custom Search