Database Reference
In-Depth Information
Stream processing
In lay terms, batch processing is the execution of one or multiple jobs. These jobs
are programmed to require minimal human intervention. Required input/output para-
meters and resources are preconfigured with jobs. The history of batch processing
mechanisms can be traced back to punch cards and mainframe computing.
Let's take an example of a satellite channel application archiving logs for many
years. At the end of each year (or maybe half yearly) the provider wants to know how
many users of a particular age range have watched specific programs in a primetime
slot. Since data volume is huge and totally unstructured, we cannot stream and fit it in-
memory to perform such computations. Such data can be largely unrelated to each oth-
er and to process these in-batch would require predefined steps for parallel processing.
With respect to large data, batch processing jobs can be categorized in three simple
steps:
extract,
transform,
and load
This process is commonly referred to as ETL . Various ETL tools such as Ab initio,
CloverETL, Pentaho, and Informatica are available for data warehousing and analytics.
Another aspect of ETL systems is analytics. Imagine a system needs to perform big
data analytics where data input points are different applications and the system needs to
generate a consolidated aggregation report. This is where data would get extracted
from various input channels and will get computed and transformed before loading it
into a database. Figure 5-1 shows an example where ETL based analytics need to ex-
tract data from social media channels, financial applications, and server logs. Trans-
formation/computation is done on the engine side and finally the output gets loaded
onto the database.
 
Search WWH ::




Custom Search