Database Reference
In-Depth Information
disparate, otherwise unrelated business systems within the enterprise - such as customer
relationship management (aka, CRM) data, marketing data, e-mail data, and finance data.
Other firms develop and offer data-prep tools designed to automate the otherwise labori-
ous “extract, transform, and load” (aka, ETL) chore involved in cleaning data for storage
in data warehouses.
On the analytics end of the equation, firms such as Cloudera, Hadapt, and Hortonworks
create and distribute software designed to enhance the open-source analytics platform Ha-
doop and make it even more robust than it is in its native state. Add to this even more firms
developing and distributing tools for exploration of vertical and niche data which leverage
R and Python, the leading open-source languages used by Data Scientists. (I'll discuss a bit
more about Hadoop and R in the next chapter.)
Boston's Doug Levin, the founder of Black Duck Software and CEO of Quant5, sums
things up vis-a-vis data curation and the tools associated with that curation when he tells
us “enabling data-driven decisions in corporations is one of today's most significant tech-
nology trends.” Levin further reminds us that the ability of Data Scientists to manipulate,
organize, and study unstructured data is quickly advancing past the previous limits of Big
Data analytics into vital new frontiers.
Data curation is the solution to what some call the “Tribal Knowledge Problem.”
What is The Tribal Knowledge Problem?
If you've ever had a business question to which you needed an answer and were re-
ferred by someone in the finance department to “go ask Brad in Personnel,” then you've
experienced The Tribal Knowledge Problem. Although the solution to the problem in this
simple scenario is as easy as taking a stroll and questioning Brad, that solution does not
scale in the digital space where lurks Big Data. The quest is for the right data hidden behind
disparate data sources. Curation must bring order to the chaos and fight darkness with light,
defining the truly relevant data as narrowly as possible. In short, in the digital space, we
need sophisticated software tools and approaches to get us to Brad and solicit his answer.
The traditional master data management (aka, MDM) top-down approach of carefully-
defined categorization is useless in this scenario because of the great velocity with which
data sources and data types change within the Big Data environment.
“While everybody is focused on the problem of how to visualize the data and how to
make the compute go faster and how to store the information more efficiently,” says Ala-
tion's Satyen Sangani, “we've seen little about the fact that there's just so much more data
out there [and therefore] a fundamental information relevance problem. How do you get
the [right] data when you need it. How do you sort through it? How do you filter down the
data to get what you're actually looking for?” That's the fundamental problem.
Search WWH ::




Custom Search