Introduction to Data Analysis with Spark - Learning Spark

Database Reference

In-Depth Information

Data Processing Applications

The other main use case of Spark can be described in the context of the engineer per‐

sona. For our purposes here, we think of engineers as a large class of software devel‐

opers who use Spark to build production data processing applications. These

developers usually have an understanding of the principles of software engineering,

such as encapsulation, interface design, and object-oriented programming. They fre‐

quently have a degree in computer science. They use their engineering skills to design

and build software systems that implement a business use case.

For engineers, Spark provides a simple way to parallelize these applications across

clusters, and hides the complexity of distributed systems programming, network

communication, and fault tolerance. The system gives them enough control to moni‐

tor, inspect, and tune applications while allowing them to implement common tasks

quickly. The modular nature of the API (based on passing distributed collections of

objects) makes it easy to factor work into reusable libraries and test it locally.

Spark's users choose to use it for their data processing applications because it pro‐

vides a wide variety of functionality, is easy to learn and use, and is mature and

reliable.

A Brief History of Spark

Spark is an open source project that has been built and is maintained by a thriving

and diverse community of developers. If you or your organization are trying Spark

for the first time, you might be interested in the history of the project. Spark started

in 2009 as a research project in the UC Berkeley RAD Lab, later to become the

AMPLab. The researchers in the lab had previously been working on Hadoop Map‐

Reduce, and observed that MapReduce was inefficient for iterative and interactive

computing jobs. Thus, from the beginning, Spark was designed to be fast for interac‐

tive queries and iterative algorithms, bringing in ideas like support for in-memory

storage and efficient fault recovery.

Research papers were published about Spark at academic conferences and soon after

its creation in 2009, it was already 10-20× faster than MapReduce for certain jobs.

Some of Spark's first users were other groups inside UC Berkeley, including machine

learning researchers such as the Mobile Millennium project, which used Spark to

monitor and predict traffic congestion in the San Francisco Bay Area. In a very short

time, however, many external organizations began using Spark, and today, over 50

organizations list themselves on the Spark PoweredBy page , and dozens speak about

their use cases at Spark community events such as Spark Meetups and the Spark

Summit . In addition to UC Berkeley, major contributors to Spark include Databricks,

Yahoo!, and Intel.

Search WWH ::

Custom Search

Home