Application Architectures for Big Data and Analytics - Big Data Imperatives

Databases Reference

In-Depth Information

Contextualizing the data

A key activity during discovery process is to build layers and layers of context associated

with data. As you add more and more types of data, the mashed up datasets form a

multi-layered interpretation engine for the original dataset. For example if a customer

browses a website extensively before making a purchase, a great deal of micro-context is

stored in all the webpage events prior to the purchase. When the purchase is made, some

of that micro-context suddenly becomes much more important. These micro-contexts are

pretty much meaningless before the purchase event, because there can be many activities

on the webpage that are irrelevant events and may be inconsequential for analysis.

However, once the purchase is made, if you have a long trail of all these micro-contexts

captured, you can then reconstruct the sequence of events leading to a successful

purchase. You can then apply this model to other ongoing web activities by customers

and determine likelihood of purchases or by clever interventions you can influence the

customer to make a purchase.

To Sample or Not to Sample

Exposing complete data sets (however big it may be) to a simple algorithm gives better

results than exposing a sample of the data sets to a sophisticated algorithm. Interesting

insights can be derived from very small populations within a larger data set that could be

missed by only sampling some of the data. The analytics community is divided in their

opinion about these two conflicting views.

Suppose you have a certain amount of data, and you look for events of a certain type

within that data. You can expect events of this type to occur, even if the data is completely

random, and the number of occurrences of these events will grow as the size of the data

grows. These occurrences are “bogus,” in the sense that they have no cause other than

that random data will always have some number of unusual features that look significant

but aren't. A theorem of statistics, known as the Bonferroni principle gives a statistically

sound way to avoid most of these bogus positive responses to a search through the data.

Bonferroni's principle helps us avoid treating random occurrences as if they were

real. Calculate the expected number of occurrences of the events you are looking for, on

the assumption that data is random. If this number is significantly larger than the number

of real instances you hope to find, then you must expect almost anything you find to

be bogus, i.e., a statistical artifact rather than evidence of what you are looking for.

This observation is the informal statement of Bonferroni's principle.

Big Data and Master Data Management (MDM)

In big data world, the data itself belongs to four different forms: data at rest, data in

motion, data in many forms, and data in doubt. In addition, there are three styles of

data integration prevalent in any enterprise scenario: bulk data movement, real-time,

and federation.

Bulk data integration involves the extraction, transformation, and loading of data

from multiple sources to one or more target databases. One of the key capabilities of bulk

integration is extreme performance and parallel processing. Batch windows continue

Search WWH ::

Custom Search

Home