Database Reference
In-Depth Information
one of those massive warehouse clubs, the ones that sell wholesale-sized pallets of toi-
let paper and ketchup by the kilogram? One of the things that has always amazed me
about these huge stores is the number of checkout lines available to handle the f low of
customers. Thousands and thousands of shoppers might be purchasing items each hour,
all day long. On any given weekend, there might be twenty or more checkout lines
open, each with dozens of customers waiting in line.
The checkout lines in a warehouse club are built to handle volume; the staff run-
ning the registers are not there to help you find items. Unlike the corner store clerk,
the job of the register staff is specialized; it's to help the huge number of customers
check out quickly.
There are other specialized tasks to be performed in our warehouse club. In order
to be able to move pallets of two-liter maple syrup bottles to the sales f loor, some
employees must specialize in driving forklifts. Other employees are there simply to
provide information to shoppers.
Now imagine that, as you pay for your extra-large pallet of liquid detergent, you
ask a person at the checkout counter about the total number of customers that pass
through all checkout lines for the entire day. It would be difficult for them to give you
a real answer. Although the person at the register might easily be able to keep a run-
ning tally of the customers making their way through a single line, it would be diffi-
cult for them to know what is going on in the other checkout lines. The individuals at
each register don't normally communicate with each other very much, as they are too
busy with their own customers. Instead, we would need to deploy a different member
of the staff whose job it is to go from register to register and aggregate the individual
customer counts.
The Right Tool for the Job
The point here is that as customer volume grows to orders of magnitude beyond what
a small convenience store is used to, it becomes necessary to build specialized solu-
tions. Massive-scale data problems work like this too. We can solve data challenges by
distributing problems across many machines and using specialized software to solve
discreet problems along the way. A data pipeline is what facilitates the movement of
our data from one state of utility to another.
Traditionally, developers depended on single-node databases to do everything. A
single machine would be used to collect data, store it permanently, and run queries
when we needed to ask a question. As data sizes grow, it becomes impossible to eco-
nomically scale a single machine to meet the demand. The only practical solution is to
distribute our needs across a collection of machines networked together in a cluster.
Collecting, processing, and analyzing large amounts of data sometimes requires
using a variety of disparate technologies. For example, software specialized for efficient
data collection may not be optimized for data analysis. This is a lot like the story of
the warehouse club versus the tiny convenience store. The optimal technology neces-
sary to ask huge, aggregate questions about massive datasets may be different from
the software used to ensure that data can be collected rapidly from thousands of users.
 
Search WWH ::




Custom Search