Analytics with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

By clicking the Executors tab at the top of the Spark Application interface, I can examine details of the application

task, as shown in Figure 9-8 . I can see the spread of tasks across the cluster nodes and the task times by node, as well

as a list of tasks, the nodes they run on, their execution times, and their statuses. The Summary Metrics section also

contains minimum, maximum, and percentile information for details like serialization, duration, and delay.

Figure 9-8. Spark Application interface shows job executors

As you can see from these simple examples, the Spark cluster is easy to install, set up, and use. To investigate

Spark in greater detail, visit the Apache Software Foundation website for Apache Spark at spark.apache.org .

Spark SQL

Rather than using the default Spark context object sc , you can create an SQL context from the default Spark context

and process CSV data using SQL. Spark SQL is an incubator project, however; it is not a mature offering at this

time. For instance, its APIs may yet change, and the version in CDH5 no longer reflects Spark SQL's latest functions.

Despite that, it does offer some interesting features for memory-based SQL cluster processing. I use a simple example

of CSV file processing using Spark SQL, based on a schema-based RDD example at https://spark.apache.org/

docs/1.0.0/sql-programming-guide.html .

■

Note

For the latest details on spark sQl, see the spark website at spark.apache.org .

Search WWH ::

Custom Search

Home