Database Reference
In-Depth Information
Chapter 3. Obtaining, Processing, and
Preparing Data with Spark
Machine learning is an extremely broad field, and these days, applications can be found
across areas that include web and mobile applications, Internet of Things and sensor net-
works, financial services, healthcare, and various scientific fields, to name just a few.
Therefore, the range of data available for potential use in machine learning is enormous. In
this topic, we will focus mostly on business applications. In this context, the data available
often consists of data internal to an organization (such as transactional data for a financial
services company) as well as external data sources (such as financial asset price data for
the same financial services company).
For example, recall from Chapter 2 , Designing a Machine Learning System , that the main
internal source of data for our hypothetical Internet business, MovieStream, consists of data
on the movies available on the site, the users of the service, and their behavior. This in-
cludes data about movies and other content (for example, title, categories, description, im-
ages, actors, and directors), user information (for example, demographics, location, and so
on), and user activity data (for example, web page views, title previews and views, ratings,
reviews, and social data such as likes , shares , and social network profiles on services in-
cluding Facebook and Twitter).
External data sources in this example might include weather and geolocation services,
third-party movie ratings and review sites such as IMDB and Rotten Tomatoes , and so on.
Generally speaking, it is quite difficult to obtain data of an internal nature for real-world
services and businesses, as it is commercially sensitive (in particular, data on purchasing
activity, user or customer behavior, and revenue) and of great potential value to the organ-
ization concerned. This is why it is also often the most useful and interesting data on which
to apply machine learning—a good machine learning model that can make accurate predic-
tions can be highly valuable (witness the success of machine learning competitions such as
the Netflix Prize and Kaggle ).
In this topic, we will make use of datasets that are publicly available to illustrate concepts
around data processing and training of machine learning models.
In this chapter, we will:
Search WWH ::




Custom Search