Storing and Managing Data in HDFS - Microsoft Big Data Solutions

Database Reference

In-Depth Information

• To enable large-scale data sets to be stored and processed

• To provide reliable data storage and be fault tolerant

• To provide support for moving the computation to the data

HDFS addresses these goals well and offers a number of architectural

features that enable these goals to be met. These goals also logically lead

to some constraints, which HDFS had to address. (The next section breaks

down these features and how they address the goals and constraints of a

distributed file system designed for large data sets.)

HDFS is implemented using the Java language. This makes it highly

portable because modern operating systems offer Java support.

Communication in HDFS is handled using Transmission Control Protocol/

Internet Protocol (TCP/IP), which also enables it to be very portable.

In HDFS terminology, an installation is usually referred to as a cluster.

A cluster is made up of individual nodes. A node is a single computer

that participates in the cluster. So when someone refers to a cluster, it

encompasses all the individual computers that participate in the cluster.

HDFS runs beside the local file system. Each computer still has a standard

file system available on it. For Linux, that may be ext4, and for a Windows

server, it's usually going to be New Technology File System (NTFS). HDFS

stores its files as files in the local file system, but not in a way that enables

you to directly interact with them. This is similar to the way SQL Server

or other relational database management systems (RDBMSs) use physical

files on disk to store their data: While the files store the data, you don't

manipulate the files directly. Instead, you go through the interface that the

database provides. HDFS uses the native file system in the same way; it

stores data there, but not in a directly useable form.

HDFS uses a write-once, read-many access model. This means that data,

once written to a file, cannot be updated. Files can be deleted and then

rewritten, but not updated. Although this might seem like a major

limitation, in practice, with large data sets, it is often much faster to delete

and replace than to perform in-place updates of data.

Now that you have a better understanding of HDFS, in the next sections

you look at the architecture behind HDFS, learn about NameNodes and

DataNodes, and find out about HDFS support for replication.

Search WWH ::

Custom Search

Home