Database Reference
In-Depth Information
Spark supports reading all the files in a given directory and doing
wildcard expansion on the input (e.g., part-*.txt ). This is useful
since large datasets are often spread across multiple files, especially
if other files (like success markers) may be in the same directory.
Saving text files
Outputting text files is also quite simple. The method saveAsTextFile() , demon‐
strated in Example 5-5 , takes a path and will output the contents of the RDD to that
file. The path is treated as a directory and Spark will output multiple files underneath
that directory. This allows Spark to write the output from multiple nodes. With this
method we don't get to control which files end up with which segments of our data,
but there are other output formats that do allow this.
Example 5-5. Saving as a text file in Python
result . saveAsTextFile ( outputFile )
JSON
JSON is a popular semistructured data format. The simplest way to load JSON data is
by loading the data as a text file and then mapping over the values with a JSON
parser. Likewise, we can use our preferred JSON serialization library to write out the
values to strings, which we can then write out. In Java and Scala we can also work
with JSON data using a custom Hadoop format . “JSON” on page 172 also shows how to
load JSON data with Spark SQL.
Loading JSON
Loading the data as a text file and then parsing the JSON data is an approach that we
can use in all of the supported languages. This works assuming that you have one
JSON record per row; if you have multiline JSON files, you will instead have to load
the whole file and then parse each file. If constructing a JSON parser is expensive in
your language, you can use mapPartitions() to reuse the parser; see “Working on a
Per-Partition Basis” on page 107 for details.
There are a wide variety of JSON libraries available for the three languages we are
looking at, but for simplicity's sake we are considering only one library per language.
In Python we will use the built-in library ( Example 5-6 ), and in Java and Scala we will
use Jackson (Examples 5-7 and 5-8 ). These libraries have been chosen because they
perform reasonably well and are also relatively simple. If you spend a lot of time in
the parsing stage, look at other JSON libraries for Scala or for Java .
Search WWH ::




Custom Search