Database Reference
In-Depth Information
Security
Early versions of Hadoop assumed that HDFS and MapReduce clusters would be used by a
group of cooperating users within a secure environment. The measures for restricting ac-
cess were designed to prevent accidental data loss, rather than to prevent unauthorized ac-
cess to data. For example, the file permissions system in HDFS prevents one user from ac-
cidentally wiping out the whole filesystem because of a bug in a program, or by mistakenly
typing hadoop fs -rmr / , but it doesn't prevent a malicious user from assuming
root's identity to access or delete any data in the cluster.
In security parlance, what was missing was a secure authentication mechanism to assure
Hadoop that the user seeking to perform an operation on the cluster is who he claims to be
and therefore can be trusted. HDFS file permissions provide only a mechanism for author-
ization , which controls what a particular user can do to a particular file. For example, a file
may be readable only by a certain group of users, so anyone not in that group is not author-
ized to read it. However, authorization is not enough by itself, because the system is still
open to abuse via spoofing by a malicious user who can gain network access to the cluster.
It's common to restrict access to data that contains personally identifiable information
(such as an end user's full name or IP address) to a small set of users (of the cluster) within
the organization who are authorized to access such information. Less sensitive (or anonym-
ized) data may be made available to a larger set of users. It is convenient to host a mix of
datasets with different security levels on the same cluster (not least because it means the
datasets with lower security levels can be shared). However, to meet regulatory require-
ments for data protection, secure authentication must be in place for shared clusters.
This is the situation that Yahoo! faced in 2009, which led a team of engineers there to im-
plement secure authentication for Hadoop. In their design, Hadoop itself does not manage
user credentials; instead, it relies on Kerberos, a mature open-source network authentica-
tion protocol, to authenticate the user. However, Kerberos doesn't manage permissions.
Kerberos says that a user is who she says she is; it's Hadoop's job to determine whether
that user has permission to perform a given action.
There's a lot to security in Hadoop, and this section only covers the highlights. For more,
readers are referred to Hadoop Security by Ben Spivey and Joey Echeverria (O'Reilly,
2014).
Kerberos and Hadoop
At a high level, there are three steps that a client must take to access a service when using
Kerberos, each of which involves a message exchange with a server:
Search WWH ::




Custom Search