Data Curation and The Tribal Knowledge Problem - A Simple Introduction to Data Science

Database Reference

In-Depth Information

disparate, otherwise unrelated business systems within the enterprise - such as customer

relationship management (aka, CRM) data, marketing data, e-mail data, and finance data.

Other firms develop and offer data-prep tools designed to automate the otherwise labori-

ous “extract, transform, and load” (aka, ETL) chore involved in cleaning data for storage

in data warehouses.

On the analytics end of the equation, firms such as Cloudera, Hadapt, and Hortonworks

create and distribute software designed to enhance the open-source analytics platform Ha-

doop and make it even more robust than it is in its native state. Add to this even more firms

developing and distributing tools for exploration of vertical and niche data which leverage

R and Python, the leading open-source languages used by Data Scientists. (I'll discuss a bit

more about Hadoop and R in the next chapter.)

Boston's Doug Levin, the founder of Black Duck Software and CEO of Quant5, sums

things up vis-a-vis data curation and the tools associated with that curation when he tells

us “enabling data-driven decisions in corporations is one of today's most significant tech-

nology trends.” Levin further reminds us that the ability of Data Scientists to manipulate,

organize, and study unstructured data is quickly advancing past the previous limits of Big

Data analytics into vital new frontiers.

Data curation is the solution to what some call the “Tribal Knowledge Problem.”

What is The Tribal Knowledge Problem?

If you've ever had a business question to which you needed an answer and were re-

ferred by someone in the finance department to “go ask Brad in Personnel,” then you've

experienced The Tribal Knowledge Problem. Although the solution to the problem in this

simple scenario is as easy as taking a stroll and questioning Brad, that solution does not

scale in the digital space where lurks Big Data. The quest is for the right data hidden behind

disparate data sources. Curation must bring order to the chaos and fight darkness with light,

defining the truly relevant data as narrowly as possible. In short, in the digital space, we

need sophisticated software tools and approaches to get us to Brad and solicit his answer.

The traditional master data management (aka, MDM) top-down approach of carefully-

defined categorization is useless in this scenario because of the great velocity with which

data sources and data types change within the Big Data environment.

“While everybody is focused on the problem of how to visualize the data and how to

make the compute go faster and how to store the information more efficiently,” says Ala-

tion's Satyen Sangani, “we've seen little about the fact that there's just so much more data

out there [and therefore] a fundamental information relevance problem. How do you get

the [right] data when you need it. How do you sort through it? How do you filter down the

data to get what you're actually looking for?” That's the fundamental problem.

Search WWH ::

Custom Search

Home