Database Reference
In-Depth Information
provides the ability to search the entire Internet, purchase any product from any seller
anywhere in the world, or provide social networking services for anyone on the planet
with access to the Internet. The massive scale of the World Wide Web, as well as the
constantly accelerating growth of the number of total Internet users, presented an
almost impossible task for software engineers: finding solutions that potentially could
be scaled to the needs of every human being to collect, store, and process the world's
data.
Traditional data analysis software, such as spreadsheets and relational databases, as
reliable and widespread as it had been, was generally designed to be used on a single
machine. In order to build these systems to be able to scale to unprecedented size,
computer scientists needed to build systems that could run on clusters of machines.
The Big Data Trade-Off
Because of the incredible task of dealing with the data needs of the World Wide
Web and its users, Internet companies and research organizations realized that a new
approach to collecting and analyzing data was necessary. Since off-the-shelf, commod-
ity computer hardware was getting cheaper every day, it made sense to think about
distributing database software across many readily available servers built from com-
modity parts. Data processing and information retrieval could be farmed out to a col-
lection of smaller computers linked together over a network. This type of computing
model is generally referred to as distributed computing . In many cases, deploying
a large number of small, cheap servers in a distributed computing system can be more
economically feasible than buying a custom built, single machine with the same com-
putation capabilities.
While the hardware model for tackling massive scale data problems was being
developed, database software started to evolve as well. The relational database model,
for all of its benefits, runs into limitations that make it challenging to deploy in a
distributed computing network. First of all, sharding a relational database across mul-
tiple machines can often be a nontrivial exercise. Because of the need to coordinate
between various machines in a cluster, maintaining a state of data consistency at any
given moment can become tricky. Furthermore, most relational databases are designed
to guarantee data consistency; in a distributed network, this type of design can create
a problem.
Software designers began to make trade-offs to accommodate the advantages of
using distributed networks to address the scale of the data coming from the Internet.
Perhaps the overall rock-solid consistency of the relational database model was less
important than making sure there was always a machine in the cluster available to pro-
cess a small bit of data. The system could always provide coordination eventually. Does
the data actually have to be indexed? Why use a fixed schema at all? Maybe databases
could simply store individual records, each with a different schema, and possibly with
redundant data.
 
 
Search WWH ::




Custom Search