Four Rules for Data Success - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

is more political than technical; overcoming the inability of admins across different

departments to break down data silos can be the true challenge.

Collecting massive amounts of data in itself doesn't provide any magic value to your

organization. The real value in data comes from understanding pain points in your

business, asking practical questions, and using the answers and insights gleaned to sup-

port decision making.

In practice, a data pipeline requires the coordination of a collection of different tech-

nologies for different parts of a data lifecycle.

Let's explore a real-world example, a common use case tackling the challenge of

collecting and analyzing data from a Web-based application that aggregates data from

many users. In order for this type of application to handle data input from thousands

or even millions of users at a time, it must be highly available . Whatever database is

used, the primary design goal of the data collection layer is that it can handle input

without becoming too slow or unresponsive. In this case, a key-value data store,

examples of which include MongoDB, Redis, Amazon's DynamoDB, and Google's

Google Cloud Datastore, might be the best solution.

Although this data is constantly streaming in and always being updated, it's useful

to have a cache, or a source of truth. This cache may be less performant, and per-

haps only needs to be updated at intervals, but it should provide consistent data when

required. This layer could also be used to provide data snapshots in formats that pro-

vide interoperability with other data software or visualization systems. This caching

layer might be f lat files in a scalable, cloud-based storage solution, or it could be a rela-

tional database backend. In some cases, developers have built the collection layer and

the cache from the same software. In other cases, this layer can be made with a hybrid

of relational and nonrelational database management systems.

Finally, in an application like this, it's important to provide a mechanism to ask

aggregate questions about the data. Software that provides quick, near-real-time analy-

sis of huge amounts of data is often designed very differently from databases that are

designed to collect data from thousands of users over a network.

In between these different stages in the data pipeline is the possibility that data

needs to be transformed. For example, data collected from a Web frontend may need

to be converted into XML files in order to be interoperable with another piece of

software. Or this data may need to be transformed into JSON or a data serialization

format, such as Thrift, to make moving the data as efficient as possible. In large-scale

data systems, transformations are often too slow to take place on a single machine. As

in the case of scalable database software, transformations are often best implemented

using distributed computing frameworks, such as Hadoop.

In the Era of Big Data Trade-Offs, building a system data lifecycle that can scale to

massive amounts of data requires specialized software for different parts of the pipeline.

Search WWH ::

Custom Search

Home