Database Reference
In-Depth Information
In 2011, the AMPLab started to develop higher-level components on Spark, such as
Shark (Hive on Spark) 1 and Spark Streaming. These and other components are some‐
times referred to as the Berkeley Data Analytics Stack (BDAS) .
Spark was first open sourced in March 2010, and was transferred to the Apache Soft‐
ware Foundation in June 2013, where it is now a top-level project.
Spark Versions and Releases
Since its creation, Spark has been a very active project and community, with the
number of contributors growing with each release. Spark 1.0 had over 100 individual
contributors. Though the level of activity has rapidly grown, the community contin‐
ues to release updated versions of Spark on a regular schedule. Spark 1.0 was released
in May 2014. This topic focuses primarily on Spark 1.1.0 and beyond, though most of
the concepts and examples also work in earlier versions.
Storage Layers for Spark
Spark can create distributed datasets from any file stored in the Hadoop distributed
filesystem (HDFS) or other storage systems supported by the Hadoop APIs (includ‐
ing your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.). It's important to
remember that Spark does not require Hadoop; it simply has support for storage sys‐
tems implementing the Hadoop APIs. Spark supports text files, SequenceFiles, Avro,
Parquet, and any other Hadoop InputFormat. We will look at interacting with these
data sources in Chapter 5 .
1 Shark has been replaced by Spark SQL.
 
Search WWH ::




Custom Search