Database Reference
In-Depth Information
Accessing publicly available datasets
Fortunately, while commercially-sensitive data can be hard to come by, there are still a
number of useful datasets available publicly. Many of these are often used as benchmark
datasets for specific types of machine learning problems. Examples of common data
sources include:
UCI Machine Learning Repository : This is a collection of almost 300 datasets of
various types and sizes for tasks including classification, regression, clustering,
and recommender systems. The list is available at http://archive.ics.uci.edu/ml/ .
Amazon AWS public datasets : This is a set of often very large datasets that can
be accessed via Amazon S3. These datasets include the Human Genome Project,
the Common Crawl web corpus, Wikipedia data, and Google Books Ngrams. In-
formation on these datasets can be found at http://aws.amazon.com/publicdatasets/ .
Kaggle : This is a collection of datasets used in machine learning competitions run
by Kaggle. Areas include classification, regression, ranking, recommender sys-
tems, and image analysis. These datasets can be found under the Competitions sec-
tion at http://www.kaggle.com/competitions .
KDnuggets : This has a detailed list of public datasets, including some of those
mentioned earlier. The list is available at http://www.kdnuggets.com/datasets/in-
dex.html .
Tip
There are many other resources to find public datasets depending on the specific domain
and machine learning task. Hopefully, you might also have exposure to some interesting
academic or commercial data of your own!
To illustrate a few key concepts related to data processing, transformation, and feature ex-
traction in Spark, we will download a commonly-used dataset for movie recommendations;
this dataset is known as the MovieLens dataset. As it is applicable to recommender sys-
tems as well as potentially other machine learning tasks, it serves as a useful example data-
set.
Note
Spark's machine learning library, MLlib, has been under heavy development since its in-
ception, and unlike the Spark core, it is still not in a fully stable state with regard to its
overall API and design.
Search WWH ::




Custom Search