Database Reference
In-Depth Information
Trash
Hadoop filesystems have a trash facility, in which deleted files are not actually deleted but
rather are moved to a trash folder, where they remain for a minimum period before being
permanently deleted by the system. The minimum period in minutes that a file will remain
in the trash is set using the
fs.trash.interval
configuration property in
core-
site.xml
. By default, the trash interval is zero, which disables trash.
Like in many operating systems, Hadoop's trash facility is a user-level feature, meaning
that only files that are deleted using the filesystem shell are put in the trash. Files deleted
programmatically are deleted immediately. It is possible to use the trash programmatic-
ally, however, by constructing a
Trash
instance, then calling its
moveToTrash()
method with the
Path
of the file intended for deletion. The method returns a value indic-
ating success; a value of
false
means either that trash is not enabled or that the file is
already in the trash.
When trash is enabled, users each have their own trash directories called
.Trash
in their
home directories. File recovery is simple: you look for the file in a subdirectory of
.Trash
and move it out of the trash subtree.
HDFS will automatically delete files in trash folders, but other filesystems will not, so you
have to arrange for this to be done periodically. You can
expunge
the trash, which will de-
lete files that have been in the trash longer than their minimum period, using the filesys-
tem shell:
%
hadoop fs -expunge
The
Trash
class exposes an
expunge()
method that has the same effect.
Job scheduler
Particularly in a multiuser setting, consider updating the job scheduler queue configura-
tion to reflect your organizational needs. For example, you can set up a queue for each
group using the cluster. See
Scheduling in YARN
.
Reduce slow start
By default, schedulers wait until 5% of the map tasks in a job have completed before
scheduling reduce tasks for the same job. For large jobs, this can cause problems with
cluster utilization, since they take up reduce containers while waiting for the map tasks to
complete. Setting
mapreduce.job.reduce.slowstart.completedmaps
to a
higher value, such as 0.80 (80%), can help improve throughput.