MapReduce with Cassandra - Beginning Apache Cassandra Development

Database Reference

In-Depth Information

•

Stream processing

In lay terms, batch processing is the execution of one or multiple jobs. These jobs

are programmed to require minimal human intervention. Required input/output para-

meters and resources are preconfigured with jobs. The history of batch processing

mechanisms can be traced back to punch cards and mainframe computing.

Let's take an example of a satellite channel application archiving logs for many

years. At the end of each year (or maybe half yearly) the provider wants to know how

many users of a particular age range have watched specific programs in a primetime

slot. Since data volume is huge and totally unstructured, we cannot stream and fit it in-

memory to perform such computations. Such data can be largely unrelated to each oth-

er and to process these in-batch would require predefined steps for parallel processing.

With respect to large data, batch processing jobs can be categorized in three simple

steps:

•

extract,

•

transform,

•

and load

This process is commonly referred to as ETL . Various ETL tools such as Ab initio,

CloverETL, Pentaho, and Informatica are available for data warehousing and analytics.

Another aspect of ETL systems is analytics. Imagine a system needs to perform big

data analytics where data input points are different applications and the system needs to

generate a consolidated aggregation report. This is where data would get extracted

from various input channels and will get computed and transformed before loading it

into a database. Figure 5-1 shows an example where ETL based analytics need to ex-

tract data from social media channels, financial applications, and server logs. Trans-

formation/computation is done on the engine side and finally the output gets loaded

onto the database.

Search WWH ::

Custom Search

Home