Database Reference
In-Depth Information
6
Data Curation and
The Tribal Knowledge Problem
“Data, I think, is one of the most powerful mechanisms for telling stories. I take a huge
pile of data and I try to get it to tell stories.”
- Steven Levitt, economist
“Not everything that can be counted counts, and not everything that counts can be
counted.”
- Albert Einstein
The phrase Big Data encompasses both the opportunity and the problem confronting
Data Scientists. That the data is so vast means great, never-before-possible opportunities for
harvesting actionable BI. The problem, meanwhile, remains the very vastness of this sea of
data, and exactly how to structure and curate this enormous chaos of unstructured data in a
way which yields the most valuable actionable BI.
F. Scott Fitzgerald said of his novels that it took only a few seconds to conceive them,
months to write [curate] them, and his “entire life” to create the memories and experien-
ces which comprised his own vast sea of unstructured data - combining disparate emotions,
memories, and wisdom to create a new, vibrant revelation in story form.
Andy Palmer, co-founder of the firms Vertica Systems and Tamir, insists Data Science
is “really not about Big Data. It's about the most useful data.” Vertica and Tamir, along with
many other companies, focus on delivering to enterprises various technologies with which
they can quickly discover the so-called “dark data” that's most relevant, and often hidden,
but of a quality which can answer what Palmer calls “compelling questions.” This task is
called data curation - also sometimes referred to in the industry as data wrangling, plumb-
ing, “munging,” and janitor work.
Firms such as Vertica Systems, Tamir, and other startups develop and offer a range of
software tools aimed at helping Data Scientists curate unstructured data economically and
productively. Some firms focus on software designed to “synchronize” data feeding in from
Search WWH ::




Custom Search