Database Reference
In-Depth Information
Data Processing Applications
The other main use case of Spark can be described in the context of the engineer per‐
sona. For our purposes here, we think of engineers as a large class of software devel‐
opers who use Spark to build production data processing applications. These
developers usually have an understanding of the principles of software engineering,
such as encapsulation, interface design, and object-oriented programming. They fre‐
quently have a degree in computer science. They use their engineering skills to design
and build software systems that implement a business use case.
For engineers, Spark provides a simple way to parallelize these applications across
clusters, and hides the complexity of distributed systems programming, network
communication, and fault tolerance. The system gives them enough control to moni‐
tor, inspect, and tune applications while allowing them to implement common tasks
quickly. The modular nature of the API (based on passing distributed collections of
objects) makes it easy to factor work into reusable libraries and test it locally.
Spark's users choose to use it for their data processing applications because it pro‐
vides a wide variety of functionality, is easy to learn and use, and is mature and
reliable.
A Brief History of Spark
Spark is an open source project that has been built and is maintained by a thriving
and diverse community of developers. If you or your organization are trying Spark
for the first time, you might be interested in the history of the project. Spark started
in 2009 as a research project in the UC Berkeley RAD Lab, later to become the
AMPLab. The researchers in the lab had previously been working on Hadoop Map‐
Reduce, and observed that MapReduce was inefficient for iterative and interactive
computing jobs. Thus, from the beginning, Spark was designed to be fast for interac‐
tive queries and iterative algorithms, bringing in ideas like support for in-memory
storage and efficient fault recovery.
Research papers were published about Spark at academic conferences and soon after
its creation in 2009, it was already 10-20× faster than MapReduce for certain jobs.
Some of Spark's first users were other groups inside UC Berkeley, including machine
learning researchers such as the Mobile Millennium project, which used Spark to
monitor and predict traffic congestion in the San Francisco Bay Area. In a very short
time, however, many external organizations began using Spark, and today, over 50
organizations list themselves on the Spark PoweredBy page , and dozens speak about
their use cases at Spark community events such as Spark Meetups and the Spark
Summit . In addition to UC Berkeley, major contributors to Spark include Databricks,
Yahoo!, and Intel.
Search WWH ::




Custom Search