Database Reference
In-Depth Information
Trash
Hadoop filesystems have a trash facility, in which deleted files are not actually deleted but
rather are moved to a trash folder, where they remain for a minimum period before being
permanently deleted by the system. The minimum period in minutes that a file will remain
in the trash is set using the fs.trash.interval configuration property in core-
site.xml . By default, the trash interval is zero, which disables trash.
Like in many operating systems, Hadoop's trash facility is a user-level feature, meaning
that only files that are deleted using the filesystem shell are put in the trash. Files deleted
programmatically are deleted immediately. It is possible to use the trash programmatic-
ally, however, by constructing a Trash instance, then calling its moveToTrash()
method with the Path of the file intended for deletion. The method returns a value indic-
ating success; a value of false means either that trash is not enabled or that the file is
already in the trash.
When trash is enabled, users each have their own trash directories called .Trash in their
home directories. File recovery is simple: you look for the file in a subdirectory of .Trash
and move it out of the trash subtree.
HDFS will automatically delete files in trash folders, but other filesystems will not, so you
have to arrange for this to be done periodically. You can expunge the trash, which will de-
lete files that have been in the trash longer than their minimum period, using the filesys-
tem shell:
% hadoop fs -expunge
The Trash class exposes an expunge() method that has the same effect.
Job scheduler
Particularly in a multiuser setting, consider updating the job scheduler queue configura-
tion to reflect your organizational needs. For example, you can set up a queue for each
group using the cluster. See Scheduling in YARN .
Reduce slow start
By default, schedulers wait until 5% of the map tasks in a job have completed before
scheduling reduce tasks for the same job. For large jobs, this can cause problems with
cluster utilization, since they take up reduce containers while waiting for the map tasks to
complete. Setting mapreduce.job.reduce.slowstart.completedmaps to a
higher value, such as 0.80 (80%), can help improve throughput.
Search WWH ::




Custom Search