Database Reference
In-Depth Information
Data ingestion and storage
The first step in our machine learning pipeline will be taking in the data that we require for
training our models. Like many other businesses, MovieStream's data is typically generated
by user activity, other systems (this is commonly referred to as machine-generated data),
and external sources (for example, the time of day and weather during a particular user's
visit to the site).
This data can be ingested in various ways, for example, gathering user activity data from
browser and mobile application event logs or accessing external web APIs to collect data
on geolocation or weather.
Once the collection mechanisms are in place, the data usually needs to be stored. This in-
cludes the raw data, data resulting from intermediate processing, and final model results to
be used in production.
Data storage can be complex and involve a wide variety of systems, including HDFS,
Amazon S3, and other filesystems; SQL databases such as MySQL or PostgreSQL; distrib-
uted NoSQL data stores such as HBase, Cassandra, and DynamoDB; and search engines
such as Solr or Elasticsearch to stream data systems such as Kafka, Flume, or Amazon Kin-
esis.
For the purposes of this topic, we will assume that the relevant data is available to us, so we
will focus on the processing and modeling steps in the following pipeline.
Search WWH ::




Custom Search