Introduction to Data Analysis with Spark - Learning Spark

Database Reference

In-Depth Information

but we can roughly classify them into two categories, data science and data

applications .

Of course, these are imprecise disciplines and usage patterns, and many folks have

skills from both, sometimes playing the role of the investigating data scientist, and

then “changing hats” and writing a hardened data processing application. Nonethe‐

less, it can be illuminating to consider the two groups and their respective use cases

separately.

Data Science Tasks

Data science, a discipline that has been emerging over the past few years, centers on

analyzing data. While there is no standard definition, for our purposes a data scientist

is somebody whose main task is to analyze and model data. Data scientists may have

experience with SQL, statistics, predictive modeling (machine learning), and pro‐

gramming, usually in Python, Matlab, or R. Data scientists also have experience with

techniques necessary to transform data into formats that can be analyzed for insights

(sometimes referred to as data wrangling ).

Data scientists use their skills to analyze data with the goal of answering a question or

discovering insights. Oftentimes, their workflow involves ad hoc analysis, so they use

interactive shells (versus building complex applications) that let them see results of

queries and snippets of code in the least amount of time. Spark's speed and simple

APIs shine for this purpose, and its built-in libraries mean that many algorithms are

available out of the box.

Spark supports the different tasks of data science with a number of components. The

Spark shell makes it easy to do interactive data analysis using Python or Scala. Spark

SQL also has a separate SQL shell that can be used to do data exploration using SQL,

or Spark SQL can be used as part of a regular Spark program or in the Spark shell.

Machine learning and data analysis is supported through the MLLib libraries. In

addition, there is support for calling out to external programs in Matlab or R. Spark

enables data scientists to tackle problems with larger data sizes than they could before

with tools like R or Pandas.

Sometimes, after the initial exploration phase, the work of a data scientist will be

“productized,” or extended, hardened (i.e., made fault-tolerant), and tuned to

become a production data processing application, which itself is a component of a

business application. For example, the initial investigation of a data scientist might

lead to the creation of a production recommender system that is integrated into a

web application and used to generate product suggestions to users. Often it is a dif‐

ferent person or team that leads the process of productizing the work of the data sci‐

entists, and that person is often an engineer.

Search WWH ::

Custom Search

Home