Introduction to Data Analysis with Spark - Learning Spark

Database Reference

In-Depth Information

CHAPTER 1

Introduction to Data Analysis with Spark

This chapter provides a high-level overview of what Apache Spark is. If you are

already familiar with Apache Spark and its components, feel free to jump ahead to

Chapter 2 .

What Is Apache Spark?

Apache Spark is a cluster computing platform designed to be fast and general-

purpose .

On the speed side, Spark extends the popular MapReduce model to efficiently sup‐

port more types of computations, including interactive queries and stream process‐

ing. Speed is important in processing large datasets, as it means the difference

between exploring data interactively and waiting minutes or hours. One of the main

features Spark offers for speed is the ability to run computations in memory, but the

system is also more efficient than MapReduce for complex applications running on

disk.

On the generality side, Spark is designed to cover a wide range of workloads that pre‐

viously required separate distributed systems, including batch applications, iterative

algorithms, interactive queries, and streaming. By supporting these workloads in the

same engine, Spark makes it easy and inexpensive to combine different processing

types, which is often necessary in production data analysis pipelines. In addition, it

reduces the management burden of maintaining separate tools.

Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala,

and SQL, and rich built-in libraries. It also integrates closely with other Big Data

tools. In particular, Spark can run in Hadoop clusters and access any Hadoop data

source, including Cassandra.

Search WWH ::

Custom Search

Home