Storing and Managing Data in HDFS - Microsoft Big Data Solutions

Database Reference

In-Depth Information

less common than node failures, so replicating across fewer racks doesn't

have an appreciable impact on availability.

NOTE

The replica placement approach is subject to change, as the HDFS

developers consider it a work in progress. As they learn more about

usage patterns, they plan to update the policies to deliver the optimal

balance of performance and availability.

HDFS monitors the replication levels of files to ensure the replication factor

is being met. If a computer hosting a DataNode were to crash, or a network

rack were taken offline, the NameNode would flag the absence of heartbeat

messages. If the nodes are offline for too long, the NameNode stores

forwarding requests to them, and it also checks the replication factors of any

data blocks associated with those nodes. In the event that the replication

factor has fallen below the threshold set when the file was created, the

NameNode begins replication of those blocks again.

Using Common Commands to Interact with HDFS

This section discusses interacting with HDFS. Even though HDFS is a

distributed file system, you can interact with it in a similar way as you

do with a traditional file system. However, this section covers some key

differences. The command examples in the following sections work with

the Hortonworks Data Platform environment setup in Chapter 3, “Installing

HDInsight.”

Interfaces for Working with HDFS

Bydefault,HDFSincludestwomechanismsforworkingwithit.Theprimary

way to interact with it is by the use of a command-line interface. For status

checks, reporting, and browsing the file system, there is also a web-based

interface.

Hadoop is a Java script that can run several modules of the Hadoop system.

The two modules that are used for HDFS are dfs (also known as FsShell)

and dfsadmin . The dfs module is used for most common file operations,

Search WWH ::

Custom Search

Home