Database Reference
In-Depth Information
is more political than technical; overcoming the inability of admins across different
departments to break down data silos can be the true challenge.
Collecting massive amounts of data in itself doesn't provide any magic value to your
organization. The real value in data comes from understanding pain points in your
business, asking practical questions, and using the answers and insights gleaned to sup-
port decision making.
Anatomy of a Big Data Pipeline
In practice, a data pipeline requires the coordination of a collection of different tech-
nologies for different parts of a data lifecycle.
Let's explore a real-world example, a common use case tackling the challenge of
collecting and analyzing data from a Web-based application that aggregates data from
many users. In order for this type of application to handle data input from thousands
or even millions of users at a time, it must be highly available . Whatever database is
used, the primary design goal of the data collection layer is that it can handle input
without becoming too slow or unresponsive. In this case, a key-value data store,
examples of which include MongoDB, Redis, Amazon's DynamoDB, and Google's
Google Cloud Datastore, might be the best solution.
Although this data is constantly streaming in and always being updated, it's useful
to have a cache, or a source of truth. This cache may be less performant, and per-
haps only needs to be updated at intervals, but it should provide consistent data when
required. This layer could also be used to provide data snapshots in formats that pro-
vide interoperability with other data software or visualization systems. This caching
layer might be f lat files in a scalable, cloud-based storage solution, or it could be a rela-
tional database backend. In some cases, developers have built the collection layer and
the cache from the same software. In other cases, this layer can be made with a hybrid
of relational and nonrelational database management systems.
Finally, in an application like this, it's important to provide a mechanism to ask
aggregate questions about the data. Software that provides quick, near-real-time analy-
sis of huge amounts of data is often designed very differently from databases that are
designed to collect data from thousands of users over a network.
In between these different stages in the data pipeline is the possibility that data
needs to be transformed. For example, data collected from a Web frontend may need
to be converted into XML files in order to be interoperable with another piece of
software. Or this data may need to be transformed into JSON or a data serialization
format, such as Thrift, to make moving the data as efficient as possible. In large-scale
data systems, transformations are often too slow to take place on a single machine. As
in the case of scalable database software, transformations are often best implemented
using distributed computing frameworks, such as Hadoop.
In the Era of Big Data Trade-Offs, building a system data lifecycle that can scale to
massive amounts of data requires specialized software for different parts of the pipeline.
 
 
Search WWH ::




Custom Search