Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Accessing publicly available datasets

Fortunately, while commercially-sensitive data can be hard to come by, there are still a

number of useful datasets available publicly. Many of these are often used as benchmark

datasets for specific types of machine learning problems. Examples of common data

sources include:

• UCI Machine Learning Repository : This is a collection of almost 300 datasets of

various types and sizes for tasks including classification, regression, clustering,

and recommender systems. The list is available at http://archive.ics.uci.edu/ml/ .

• Amazon AWS public datasets : This is a set of often very large datasets that can

be accessed via Amazon S3. These datasets include the Human Genome Project,

the Common Crawl web corpus, Wikipedia data, and Google Books Ngrams. In-

formation on these datasets can be found at http://aws.amazon.com/publicdatasets/ .

• Kaggle : This is a collection of datasets used in machine learning competitions run

by Kaggle. Areas include classification, regression, ranking, recommender sys-

tems, and image analysis. These datasets can be found under the Competitions sec-

tion at http://www.kaggle.com/competitions .

• KDnuggets : This has a detailed list of public datasets, including some of those

mentioned earlier. The list is available at http://www.kdnuggets.com/datasets/in-

dex.html .

Tip

There are many other resources to find public datasets depending on the specific domain

and machine learning task. Hopefully, you might also have exposure to some interesting

academic or commercial data of your own!

To illustrate a few key concepts related to data processing, transformation, and feature ex-

traction in Spark, we will download a commonly-used dataset for movie recommendations;

this dataset is known as the MovieLens dataset. As it is applicable to recommender sys-

tems as well as potentially other machine learning tasks, it serves as a useful example data-

set.

Note

Spark's machine learning library, MLlib, has been under heavy development since its in-

ception, and unlike the Spark core, it is still not in a fully stable state with regard to its

overall API and design.

Search WWH ::

Custom Search

Home