Database Reference
In-Depth Information
NOTE
Just because HDFS can be run on commodity hardware and manages
redundancy doesn't mean that you should ignore the reliability and
performance of the computers used in an HDFS cluster. Using more
reliable hardware means less time spent replacing broken components.
And, just like any other application, HDFS will benefit from more
computing resources to work with. In particular, NameNodes
(discussed next in the “NameNodes and DataNodes” section) benefit
from reliable hardware and high-performing components.
Asalreadystated,thedatabeingstoredinHDFSisspreadoutandreplicated
across multiple machines. This makes the system resilient to the failure of
any individual machine. Depending on the level of redundancy configured,
the system may be able to withstand the loss of multiple machines.
Another area where HDFS enables support for large data sets is in
computation. Although HDFS does not perform computation directly, it
doessupportmovingthecomputationsclosertothedata.Inmanycomputer
systems, the data is moved from a server to another computer, which
performs any needed computations. Then the data may be moved back to
the original server or moved to yet another server.
This is a common pattern in applications that leverage a relational database.
Data is retrieved from a database server to a client computer, where the
application logic to update or process the data is applied. Finally, the data is
saved to the database server. This pattern makes sense when you consider
that the data is being stored on a single computer. If all the application logic
were performed on the database server, a single computationally intensive
process could block any other user from performing his or her work. By
offloading application logic to client computers, it increases the database
server's capability to serve data and spreads the computation work across
more machines.
This approach works well for smaller data sets, but it rapidly breaks down
when you begin dealing with data sets in the 1TB and up range. Moving
that much data across the network can introduce a tremendous amount
of latency. In HDFS, though, the data is spread out over many computers.
By moving the computations closer to the data, HDFS avoids the overhead
Search WWH ::




Custom Search