Strategies for Dealing with Data Silos - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

just anyone in your company should be able to peruse the private financial or medical

records of customers. What it does mean is that the people in your organization are

given access to tools that help them find out the answers to questions quickly and are

expected to back up their ideas with metrics when appropriate. It also means sharing

organizational data freely to help inform, inspire, and empower employees to come up

with their own innovative ideas for solving problems.

Invest in Technology That Bridges Data Silos

Embracing the concept that data silos are actually beneficial allows system architects to

rethink their approaches. If the data warehouse is seen less as a be-all and end-all solu-

tion to data analysis, and distributed computing tools such as Hadoop are seen as use-

ful for processing tasks, this allows administrators to focus on investing in technologies

that bridge the gaps between these systems.

Visualization tools, such as Tableau and QlikView, are beginning to provide access

not only to traditional relational database systems through traditional ODBC drivers

but also to new data tools such as Google's BigQuery and Cloudera's Impala. Similarly,

users of business productivity tools, such as Microsoft Excel, should be able to run

queries by using connectors to underlying data warehousing software.

Many of the technologies and jargon of traditional data warehousing were developed

before the wide-scale adoption of the Internet. Large, single-machine data warehouse

appliances are common in the enterprise market and often bring with them large price

tags and expensive support contracts. Building a distributed data processing system on

a cluster of machines using open-source technologies such as Hadoop provides a differ-

ent type of challenge, requiring expertise and infrastructure maintenance. Essentially,

dealing with either of these system designs requires specialized training and various

trade-offs.

In practice, data warehousing and distributed computing technologies such as

Hadoop have overlapping use cases. For example, creating a MapReduce workf low

can often be a more performant way to solve a complicated ETL transformation step

when moving data from a customer database to a data warehouse. There's been a grad-

ual movement, from both commercial and open-source projects, towards combining

aspects of popular distributed data projects with features found in data warehouses and

analytical databases. For example, the Spark project, an open-source distributed com-

puting system, is designed to be a very fast in-memory analytics platform. One of the

most interesting projects built with Spark is Shark, a data warehouse application that

is compatible with Hadoop's Hive. As a result of this combination, Spark and Shark

together provide both a warehousing capability and fast analytics capabilities. Tradi-

tional data warehousing products are also getting into the act, with industry stalwarts

such as Oracle and SAP incorporating Hadoop into their offerings.

Search WWH ::

Custom Search

Home