Database Reference
In-Depth Information
• To enable large-scale data sets to be stored and processed
• To provide reliable data storage and be fault tolerant
• To provide support for moving the computation to the data
HDFS addresses these goals well and offers a number of architectural
features that enable these goals to be met. These goals also logically lead
to some constraints, which HDFS had to address. (The next section breaks
down these features and how they address the goals and constraints of a
distributed file system designed for large data sets.)
HDFS is implemented using the Java language. This makes it highly
portable because modern operating systems offer Java support.
Communication in HDFS is handled using Transmission Control Protocol/
Internet Protocol (TCP/IP), which also enables it to be very portable.
In HDFS terminology, an installation is usually referred to as a cluster.
A cluster is made up of individual nodes. A node is a single computer
that participates in the cluster. So when someone refers to a cluster, it
encompasses all the individual computers that participate in the cluster.
HDFS runs beside the local file system. Each computer still has a standard
file system available on it. For Linux, that may be ext4, and for a Windows
server, it's usually going to be New Technology File System (NTFS). HDFS
stores its files as files in the local file system, but not in a way that enables
you to directly interact with them. This is similar to the way SQL Server
or other relational database management systems (RDBMSs) use physical
files on disk to store their data: While the files store the data, you don't
manipulate the files directly. Instead, you go through the interface that the
database provides. HDFS uses the native file system in the same way; it
stores data there, but not in a directly useable form.
HDFS uses a write-once, read-many access model. This means that data,
once written to a file, cannot be updated. Files can be deleted and then
rewritten, but not updated. Although this might seem like a major
limitation, in practice, with large data sets, it is often much faster to delete
and replace than to perform in-place updates of data.
Now that you have a better understanding of HDFS, in the next sections
you look at the architecture behind HDFS, learn about NameNodes and
DataNodes, and find out about HDFS support for replication.
Search WWH ::




Custom Search