Database Reference
In-Depth Information
Application Lifespan
The lifespan of a YARN application can vary dramatically: from a short-lived application
of a few seconds to a long-running application that runs for days or even months. Rather
than look at how long the application runs for, it's useful to categorize applications in
terms of how they map to the jobs that users run. The simplest case is one application per
user job, which is the approach that MapReduce takes.
The second model is to run one application per workflow or user session of (possibly un-
related) jobs. This approach can be more efficient than the first, since containers can be
reused between jobs, and there is also the potential to cache intermediate data between
jobs. Spark is an example that uses this model.
The third model is a long-running application that is shared by different users. Such an ap-
plication often acts in some kind of coordination role. For example, Apache Slider has a
long-running application master for launching other applications on the cluster. This ap-
proach is also used by Impala (see SQL-on-Hadoop Alternatives ) to provide a proxy ap-
plication that the Impala daemons communicate with to request cluster resources. The “al-
ways on” application master means that users have very low-latency responses to their
queries since the overhead of starting a new application master is avoided. [ 37 ]
Building YARN Applications
Writing a YARN application from scratch is fairly involved, but in many cases is not ne-
cessary, as it is often possible to use an existing application that fits the bill. For example,
if you are interested in running a directed acyclic graph (DAG) of jobs, then Spark or Tez
is appropriate; or for stream processing, Spark, Samza, or Storm works. [ 38 ]
There are a couple of projects that simplify the process of building a YARN application.
Apache Slider, mentioned earlier, makes it possible to run existing distributed applications
on YARN. Users can run their own instances of an application (such as HBase) on a
cluster, independently of other users, which means that different users can run different
versions of the same application. Slider provides controls to change the number of nodes
an application is running on, and to suspend then resume a running application.
Apache Twill is similar to Slider, but in addition provides a simple programming model
for developing distributed applications on YARN. Twill allows you to define cluster pro-
cesses as an extension of a Java Runnable , then runs them in YARN containers on the
cluster. Twill also provides support for, among other things, real-time logging (log events
from runnables are streamed back to the client) and command messages (sent from the cli-
ent to runnables).
Search WWH ::




Custom Search