Setting Up a Hadoop Cluster - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Trash

Hadoop filesystems have a trash facility, in which deleted files are not actually deleted but

rather are moved to a trash folder, where they remain for a minimum period before being

permanently deleted by the system. The minimum period in minutes that a file will remain

in the trash is set using the fs.trash.interval configuration property in core-

site.xml . By default, the trash interval is zero, which disables trash.

Like in many operating systems, Hadoop's trash facility is a user-level feature, meaning

that only files that are deleted using the filesystem shell are put in the trash. Files deleted

programmatically are deleted immediately. It is possible to use the trash programmatic-

ally, however, by constructing a Trash instance, then calling its moveToTrash()

method with the Path of the file intended for deletion. The method returns a value indic-

ating success; a value of false means either that trash is not enabled or that the file is

already in the trash.

When trash is enabled, users each have their own trash directories called .Trash in their

home directories. File recovery is simple: you look for the file in a subdirectory of .Trash

and move it out of the trash subtree.

HDFS will automatically delete files in trash folders, but other filesystems will not, so you

have to arrange for this to be done periodically. You can expunge the trash, which will de-

lete files that have been in the trash longer than their minimum period, using the filesys-

tem shell:

% hadoop fs -expunge

The Trash class exposes an expunge() method that has the same effect.

Job scheduler

Particularly in a multiuser setting, consider updating the job scheduler queue configura-

tion to reflect your organizational needs. For example, you can set up a queue for each

group using the cluster. See Scheduling in YARN .

Reduce slow start

By default, schedulers wait until 5% of the map tasks in a job have completed before

scheduling reduce tasks for the same job. For large jobs, this can cause problems with

cluster utilization, since they take up reduce containers while waiting for the map tasks to

complete. Setting mapreduce.job.reduce.slowstart.completedmaps to a

higher value, such as 0.80 (80%), can help improve throughput.

Search WWH ::

Custom Search

Home