Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

Spark supports reading all the files in a given directory and doing

wildcard expansion on the input (e.g., part-*.txt ). This is useful

since large datasets are often spread across multiple files, especially

if other files (like success markers) may be in the same directory.

Saving text files

Outputting text files is also quite simple. The method saveAsTextFile() , demon‐

strated in Example 5-5 , takes a path and will output the contents of the RDD to that

file. The path is treated as a directory and Spark will output multiple files underneath

that directory. This allows Spark to write the output from multiple nodes. With this

method we don't get to control which files end up with which segments of our data,

but there are other output formats that do allow this.

Example 5-5. Saving as a text file in Python

result . saveAsTextFile ( outputFile )

JSON

JSON is a popular semistructured data format. The simplest way to load JSON data is

by loading the data as a text file and then mapping over the values with a JSON

parser. Likewise, we can use our preferred JSON serialization library to write out the

values to strings, which we can then write out. In Java and Scala we can also work

with JSON data using a custom Hadoop format . “JSON” on page 172 also shows how to

load JSON data with Spark SQL.

Loading JSON

Loading the data as a text file and then parsing the JSON data is an approach that we

can use in all of the supported languages. This works assuming that you have one

JSON record per row; if you have multiline JSON files, you will instead have to load

the whole file and then parse each file. If constructing a JSON parser is expensive in

your language, you can use mapPartitions() to reuse the parser; see “Working on a

Per-Partition Basis” on page 107 for details.

There are a wide variety of JSON libraries available for the three languages we are

looking at, but for simplicity's sake we are considering only one library per language.

In Python we will use the built-in library ( Example 5-6 ), and in Java and Scala we will

use Jackson (Examples 5-7 and 5-8 ). These libraries have been chosen because they

perform reasonably well and are also relatively simple. If you spend a lot of time in

the parsing stage, look at other JSON libraries for Scala or for Java .

Search WWH ::

Custom Search

Home