Strategies for Dealing with Data Silos - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

problem—one in which a disparity between two physical information stores blocked

easy interoperability between the data source and the data processing system. Luhn had

imagined that a growth in the adoption of analog to digital bridge technology, such as

typewriters with simultaneous ticker tape printout, could eliminate this disparity.

Although the dream of a truly paperless office has been a common meme through-

out the past few decades (paper use actually doubled between 1980 and 2000 3 ), casual

business information is rapidly being generated digitally. Unstructured but useful

sources of information are becoming the norm as consumer and business platforms

converge. Emails, customer reviews, tweets, and user groups can be both valuable

sources of “business information” and a nightmare to search, store, and query. Some

business applications are mirroring features found in social media, enabling employ-

ees to stream posts of data to company-wide social media platforms. As a result, data

becomes more fractured, more unstructured, and simply more , period.

The Problem in Practice

Let's look at an example of a typical data silo challenge. Customers generate data of

all kinds, and this generated data is difficult to control. A customer might report her

location as “California” in one transaction and use the abbreviation “CA” in another.

Customers also generate support questions, post messages on your company's Facebook

page, send emails, and typically do everything in their power to ensure that whatever

ideal data model you want to conform to will be disregarded. Dealing with customer

data can be difficult enough, but what about all the other data required to understand

your business? This might include data from your product inventory, human resources,

advertising, finance, and any number of applications crucial to business decisions.

In order to make any sense of this data, it must often be cleaned, or transformed

into a more normalized form. Erroneous data must be corrected or discarded, and

dates must all be converted into the same format. More importantly, if data sets are

to be joined in any meaningful way, common keys must be available. In other words,

a user's ID must be the exact same value in the purchase database as it is in your cus-

tomer support logs.

Once this data is processed, it must be stored in a way that enables users to ask

questions about their data. Query results from this data can then be brought into visu-

alization tools or moved into spreadsheets for further analysis. All modern organiza-

tions, big or small, deal with data in some way, and each of these steps can become

daunting if data sizes are large and data sources are disparate.

I once worked at a small nonprofit, and we faced the same types of data issues as

any large corporation. Our donor database was implemented using a relational database

hosted on a single machine. We also had a Web-based system for online donations,

which collected names and addresses, and this information was stored in a separate

relational database. We made extra money by selling topics and CDs, the inventory

3. www.economist.com/node/12381449

Search WWH ::

Custom Search

Home