Extracting Value From Big Data: In-Memory Solutions, Real Time Analytics, And Recommendation Systems - Big Data Imperatives

Databases Reference

In-Depth Information

Process streaming data: The key to real-time analytics is that we cannot wait

until later to do things to our data; we must analyze it instantly. Stream processing (also

known as streaming data processing) is the term used for analyzing data instantly as

it's collected. Actions that you can perform in real-time include splitting data, merging

it, doing calculations, connecting it with outside data sources, forking data to multiple

destinations, and more.

Explore, analyze, and visualize data: Now that data has been processed, it is

reliably delivered to the databases that power your reports, dashboards, and ad-hoc

queries. There are specialized streaming data algorithms and advanced data visualization

techniques that you can employ to generate insights.

In real-time systems, scoring is an extremely important activity, and it is triggered by

actions (by consumers at a website or by an operational system through an API), and the

resulting action or messages are brokered through the consumption channels. During the

scoring activity, some real-time systems will use the same hardware that's used for data

ingestion, but they will not use the same data. At this phase of the process, the scoring

rules are kept separate from the ingested data. Note also that at this phase, the limitations

of Hadoop become apparent. Hadoop today is not particularly well suited for real-time

scoring, although it can be used for “near real-time” applications such as populating large

tables or pre-computing scores.

Data is always changing, so there is a need to refresh the data and refresh the model

built on the original data. The existing scripts or programs used to run the data and build

the models can be re-used to refresh the models. Simple exploratory data analysis is also

recommended, along with periodic (weekly, daily, or hourly) model refreshes.

Refreshing the model based on re-ingesting the data and re-running the scripts will

only work for a limited time, since the underlying data and even the underlying structure

of the data will eventually change so much that the model will no longer be valid.

Important variables can become non-significant, non-significant variables can become

important, and new data sources are continuously emerging. If the model accuracy

measure begins drifting, you have to go back and reexamine the data. If necessary, go

back and rebuild the model from scratch.

Actions: Once you start spotting patterns and anomalies in the streaming data,

you need to channel these insights to appropriate consumption channels. This is the

layer that most people see. It's the layer at which business analysts, c-suite executives,

and customers interact with the real-time big data analytics system.

Real-time big data analytics is an iterative process involving multiple

tools and systems.

The Hadoop and NoSQL Conundrum

In earlier chapters we have discussed at length how Hadoop framework helps in analyzing

massive sets of data by distributing the computation load across many processes and

machines. Hadoop embraces a map-reduce framework, which means analytics are performed

as batch processes. Depending on the quantity of data and the complexity of the computation,

running a set of Hadoop jobs could take anywhere from a few minutes to many days. Batch

processing tool sets like Hadoop are great for doing one-off reports, a recurring schedule of

Big Data Imperatives

Search WWH ::

Custom Search

Home