Databases Reference
In-Depth Information
Chapter 9
Data Scientist
The realm of big data analytics is vastly different from transaction
processing applications and BI applications; here, one discovers and
answers questions in area where we don't know what we don't know.
The skills required to do these kinds of activities are unique and certainly
multi-faceted.
On a general level we can define data as having three important characteristics:
composition, context, and condition. Composition refers to the structure of the data:
what is the source, what is the granularity, what are the data types, what is the nature
of the data (mostly static data or real time streaming data), etc. Context refers to how it
was generated, what events are associated with the data, how sensitive the data is, etc.
Condition refers to the state of the data and whether it can be used as-is for analysis or it
needs further cleansing and enrichment.
Let's apply these characteristics to small data and big data. Small data consists of
mostly known data sources that are not expected to undergo changes in composition and
context over a given period of time. Since there is a fair amount of certainty regarding
small data, we use it solve specific problems through straightforward applications
(transaction processing applications, BI reporting, etc.). In essence, small data is limited
to answering questions about what we know we don't know. Big data, on the other hand,
represents multiple and unknown data sets. These data sets continuously exhibit changes
in composition, context, and condition. Thus big data signifies the complexity: we don't
know what we don't know!
The biggest problem on hand is how to derive value from big data and
finding a way to measure the amount of knowledge contained in data.
A measure of the amount of knowledge contained in data can possibly
be defined as the number of insights one can generate by exploiting all the
possible range of values (combination and/or permutations) contained
within the attributes of the data set. The relative knowledge contained
within two variables (A and B), for example, can be assessed by looking
at A alone, then B alone, and then A and B, for a total of three scenarios.
Three variables (A, B, and C) gives use a knowledge state space of seven.
Four subjects results in 15. And so on.
 
Search WWH ::




Custom Search