Databases Reference
In-Depth Information
Process streaming data: The key to real-time analytics is that we cannot wait
until later to do things to our data; we must analyze it instantly. Stream processing (also
known as streaming data processing) is the term used for analyzing data instantly as
it's collected. Actions that you can perform in real-time include splitting data, merging
it, doing calculations, connecting it with outside data sources, forking data to multiple
destinations, and more.
Explore, analyze, and visualize data: Now that data has been processed, it is
reliably delivered to the databases that power your reports, dashboards, and ad-hoc
queries. There are specialized streaming data algorithms and advanced data visualization
techniques that you can employ to generate insights.
In real-time systems, scoring is an extremely important activity, and it is triggered by
actions (by consumers at a website or by an operational system through an API), and the
resulting action or messages are brokered through the consumption channels. During the
scoring activity, some real-time systems will use the same hardware that's used for data
ingestion, but they will not use the same data. At this phase of the process, the scoring
rules are kept separate from the ingested data. Note also that at this phase, the limitations
of Hadoop become apparent. Hadoop today is not particularly well suited for real-time
scoring, although it can be used for “near real-time” applications such as populating large
tables or pre-computing scores.
Data is always changing, so there is a need to refresh the data and refresh the model
built on the original data. The existing scripts or programs used to run the data and build
the models can be re-used to refresh the models. Simple exploratory data analysis is also
recommended, along with periodic (weekly, daily, or hourly) model refreshes.
Refreshing the model based on re-ingesting the data and re-running the scripts will
only work for a limited time, since the underlying data and even the underlying structure
of the data will eventually change so much that the model will no longer be valid.
Important variables can become non-significant, non-significant variables can become
important, and new data sources are continuously emerging. If the model accuracy
measure begins drifting, you have to go back and reexamine the data. If necessary, go
back and rebuild the model from scratch.
Actions: Once you start spotting patterns and anomalies in the streaming data,
you need to channel these insights to appropriate consumption channels. This is the
layer that most people see. It's the layer at which business analysts, c-suite executives,
and customers interact with the real-time big data analytics system.
Real-time big data analytics is an iterative process involving multiple
tools and systems.
The Hadoop and NoSQL Conundrum
In earlier chapters we have discussed at length how Hadoop framework helps in analyzing
massive sets of data by distributing the computation load across many processes and
machines. Hadoop embraces a map-reduce framework, which means analytics are performed
as batch processes. Depending on the quantity of data and the complexity of the computation,
running a set of Hadoop jobs could take anywhere from a few minutes to many days. Batch
processing tool sets like Hadoop are great for doing one-off reports, a recurring schedule of
 
Search WWH ::




Custom Search