Databases Reference
In-Depth Information
Contextualizing the data
A key activity during discovery process is to build layers and layers of context associated
with data. As you add more and more types of data, the mashed up datasets form a
multi-layered interpretation engine for the original dataset. For example if a customer
browses a website extensively before making a purchase, a great deal of micro-context is
stored in all the webpage events prior to the purchase. When the purchase is made, some
of that micro-context suddenly becomes much more important. These micro-contexts are
pretty much meaningless before the purchase event, because there can be many activities
on the webpage that are irrelevant events and may be inconsequential for analysis.
However, once the purchase is made, if you have a long trail of all these micro-contexts
captured, you can then reconstruct the sequence of events leading to a successful
purchase. You can then apply this model to other ongoing web activities by customers
and determine likelihood of purchases or by clever interventions you can influence the
customer to make a purchase.
To Sample or Not to Sample
Exposing complete data sets (however big it may be) to a simple algorithm gives better
results than exposing a sample of the data sets to a sophisticated algorithm. Interesting
insights can be derived from very small populations within a larger data set that could be
missed by only sampling some of the data. The analytics community is divided in their
opinion about these two conflicting views.
Suppose you have a certain amount of data, and you look for events of a certain type
within that data. You can expect events of this type to occur, even if the data is completely
random, and the number of occurrences of these events will grow as the size of the data
grows. These occurrences are “bogus,” in the sense that they have no cause other than
that random data will always have some number of unusual features that look significant
but aren't. A theorem of statistics, known as the Bonferroni principle gives a statistically
sound way to avoid most of these bogus positive responses to a search through the data.
Bonferroni's principle helps us avoid treating random occurrences as if they were
real. Calculate the expected number of occurrences of the events you are looking for, on
the assumption that data is random. If this number is significantly larger than the number
of real instances you hope to find, then you must expect almost anything you find to
be bogus, i.e., a statistical artifact rather than evidence of what you are looking for.
This observation is the informal statement of Bonferroni's principle.
Big Data and Master Data Management (MDM)
In big data world, the data itself belongs to four different forms: data at rest, data in
motion, data in many forms, and data in doubt. In addition, there are three styles of
data integration prevalent in any enterprise scenario: bulk data movement, real-time,
and federation.
Bulk data integration involves the extraction, transformation, and loading of data
from multiple sources to one or more target databases. One of the key capabilities of bulk
integration is extreme performance and parallel processing. Batch windows continue
 
Search WWH ::




Custom Search