Introduction to Data Analysis with Spark - Learning Spark

Database Reference

In-Depth Information

In 2011, the AMPLab started to develop higher-level components on Spark, such as

Shark (Hive on Spark) 1 and Spark Streaming. These and other components are some‐

times referred to as the Berkeley Data Analytics Stack (BDAS) .

Spark was first open sourced in March 2010, and was transferred to the Apache Soft‐

ware Foundation in June 2013, where it is now a top-level project.

Spark Versions and Releases

Since its creation, Spark has been a very active project and community, with the

number of contributors growing with each release. Spark 1.0 had over 100 individual

contributors. Though the level of activity has rapidly grown, the community contin‐

ues to release updated versions of Spark on a regular schedule. Spark 1.0 was released

in May 2014. This topic focuses primarily on Spark 1.1.0 and beyond, though most of

the concepts and examples also work in earlier versions.

Storage Layers for Spark

Spark can create distributed datasets from any file stored in the Hadoop distributed

filesystem (HDFS) or other storage systems supported by the Hadoop APIs (includ‐

ing your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.). It's important to

remember that Spark does not require Hadoop; it simply has support for storage sys‐

tems implementing the Hadoop APIs. Spark supports text files, SequenceFiles, Avro,

Parquet, and any other Hadoop InputFormat. We will look at interacting with these

data sources in Chapter 5 .

1 Shark has been replaced by Spark SQL.

Search WWH ::

Custom Search

Home