Storing and Managing Data in HDFS - Microsoft Big Data Solutions

Database Reference

In-Depth Information

NOTE

Just because HDFS can be run on commodity hardware and manages

redundancy doesn't mean that you should ignore the reliability and

performance of the computers used in an HDFS cluster. Using more

reliable hardware means less time spent replacing broken components.

And, just like any other application, HDFS will benefit from more

computing resources to work with. In particular, NameNodes

(discussed next in the “NameNodes and DataNodes” section) benefit

from reliable hardware and high-performing components.

Asalreadystated,thedatabeingstoredinHDFSisspreadoutandreplicated

across multiple machines. This makes the system resilient to the failure of

any individual machine. Depending on the level of redundancy configured,

the system may be able to withstand the loss of multiple machines.

Another area where HDFS enables support for large data sets is in

computation. Although HDFS does not perform computation directly, it

doessupportmovingthecomputationsclosertothedata.Inmanycomputer

systems, the data is moved from a server to another computer, which

performs any needed computations. Then the data may be moved back to

the original server or moved to yet another server.

This is a common pattern in applications that leverage a relational database.

Data is retrieved from a database server to a client computer, where the

application logic to update or process the data is applied. Finally, the data is

saved to the database server. This pattern makes sense when you consider

that the data is being stored on a single computer. If all the application logic

were performed on the database server, a single computationally intensive

process could block any other user from performing his or her work. By

offloading application logic to client computers, it increases the database

server's capability to serve data and spreads the computation work across

more machines.

This approach works well for smaller data sets, but it rapidly breaks down

when you begin dealing with data sets in the 1TB and up range. Moving

that much data across the network can introduce a tremendous amount

of latency. In HDFS, though, the data is spread out over many computers.

By moving the computations closer to the data, HDFS avoids the overhead

Search WWH ::

Custom Search

Home